[jira] [Updated] (LUCENE-3102) Few issues with CachingCollector

2011-05-18 Thread Shai Erera (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-3102?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shai Erera updated LUCENE-3102:
---

Attachment: LUCENE-3102-nowrap.patch

Patch against 3x:
* Adds a create() to CachingCollector which does not take a Collector to wrap. 
Internally, it creates a no-op collector, which ignores everything.
* Javadocs for create()
* matching test.

 Few issues with CachingCollector
 

 Key: LUCENE-3102
 URL: https://issues.apache.org/jira/browse/LUCENE-3102
 Project: Lucene - Java
  Issue Type: Bug
  Components: core/search
Reporter: Shai Erera
Assignee: Shai Erera
Priority: Minor
 Fix For: 3.2, 4.0

 Attachments: LUCENE-3102-factory.patch, LUCENE-3102-nowrap.patch, 
 LUCENE-3102.patch, LUCENE-3102.patch


 CachingCollector (introduced in LUCENE-1421) has few issues:
 # Since the wrapped Collector may support out-of-order collection, the 
 document IDs cached may be out-of-order (depends on the Query) and thus 
 replay(Collector) will forward document IDs out-of-order to a Collector that 
 may not support it.
 # It does not clear cachedScores + cachedSegs upon exceeding RAM limits
 # I think that instead of comparing curScores to null, in order to determine 
 if scores are requested, we should have a specific boolean - for clarity
 # This check if (base + nextLength  maxDocsToCache) (line 168) can be 
 relaxed? E.g., what if nextLength is, say, 512K, and I cannot satisfy the 
 maxDocsToCache constraint, but if it was 10K I would? Wouldn't we still want 
 to try and cache them?
 Also:
 * The TODO in line 64 (having Collector specify needsScores()) -- why do we 
 need that if CachingCollector ctor already takes a boolean cacheScores? I 
 think it's better defined explicitly than implicitly?
 * Let's introduce a factory method for creating a specialized version if 
 scoring is requested / not (i.e., impl the TODO in line 189)
 * I think it's a useful collector, which stands on its own and not specific 
 to grouping. Can we move it to core?
 * How about using OpenBitSet instead of int[] for doc IDs?
 ** If the number of hits is big, we'd gain some RAM back, and be able to 
 cache more entries
 ** NOTE: OpenBitSet can only be used for in-order collection only. So we can 
 use that if the wrapped Collector does not support out-of-order
 * Do you think we can modify this Collector to not necessarily wrap another 
 Collector? We have such Collector which stores (in-memory) all matching doc 
 IDs + scores (if required). Those are later fed into several processes that 
 operate on them (e.g. fetch more info from the index etc.). I am thinking, we 
 can make CachingCollector *optionally* wrap another Collector and then 
 someone can reuse it by setting RAM limit to unlimited (we should have a 
 constant for that) in order to simply collect all matching docs + scores.
 * I think a set of dedicated unit tests for this class alone would be good.
 That's it so far. Perhaps, if we do all of the above, more things will pop up.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-3102) Few issues with CachingCollector

2011-05-18 Thread Shai Erera (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-3102?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shai Erera updated LUCENE-3102:
---

Attachment: LUCENE-3102-nowrap.patch

Patch adds random to TestGrouping and fixes the CHANGES typo.

Mike, TestGrouping fails w/ this seed: 
-Dtests.seed=7295196064099074191:-1632255311098421589 (it picks a no wrapping 
collector).

I guess I didn't insert the random thing properly. It's the only place where 
the test creates a CachingCollector though. I noticed that it fails on the 
'doCache' but '!doAllGroups' case.

Can you please take a look? I'm not familiar with this test, and cannot debug 
it anymore today.

 Few issues with CachingCollector
 

 Key: LUCENE-3102
 URL: https://issues.apache.org/jira/browse/LUCENE-3102
 Project: Lucene - Java
  Issue Type: Bug
  Components: core/search
Reporter: Shai Erera
Assignee: Shai Erera
Priority: Minor
 Fix For: 3.2, 4.0

 Attachments: LUCENE-3102-factory.patch, LUCENE-3102-nowrap.patch, 
 LUCENE-3102-nowrap.patch, LUCENE-3102.patch, LUCENE-3102.patch


 CachingCollector (introduced in LUCENE-1421) has few issues:
 # Since the wrapped Collector may support out-of-order collection, the 
 document IDs cached may be out-of-order (depends on the Query) and thus 
 replay(Collector) will forward document IDs out-of-order to a Collector that 
 may not support it.
 # It does not clear cachedScores + cachedSegs upon exceeding RAM limits
 # I think that instead of comparing curScores to null, in order to determine 
 if scores are requested, we should have a specific boolean - for clarity
 # This check if (base + nextLength  maxDocsToCache) (line 168) can be 
 relaxed? E.g., what if nextLength is, say, 512K, and I cannot satisfy the 
 maxDocsToCache constraint, but if it was 10K I would? Wouldn't we still want 
 to try and cache them?
 Also:
 * The TODO in line 64 (having Collector specify needsScores()) -- why do we 
 need that if CachingCollector ctor already takes a boolean cacheScores? I 
 think it's better defined explicitly than implicitly?
 * Let's introduce a factory method for creating a specialized version if 
 scoring is requested / not (i.e., impl the TODO in line 189)
 * I think it's a useful collector, which stands on its own and not specific 
 to grouping. Can we move it to core?
 * How about using OpenBitSet instead of int[] for doc IDs?
 ** If the number of hits is big, we'd gain some RAM back, and be able to 
 cache more entries
 ** NOTE: OpenBitSet can only be used for in-order collection only. So we can 
 use that if the wrapped Collector does not support out-of-order
 * Do you think we can modify this Collector to not necessarily wrap another 
 Collector? We have such Collector which stores (in-memory) all matching doc 
 IDs + scores (if required). Those are later fed into several processes that 
 operate on them (e.g. fetch more info from the index etc.). I am thinking, we 
 can make CachingCollector *optionally* wrap another Collector and then 
 someone can reuse it by setting RAM limit to unlimited (we should have a 
 constant for that) in order to simply collect all matching docs + scores.
 * I think a set of dedicated unit tests for this class alone would be good.
 That's it so far. Perhaps, if we do all of the above, more things will pop up.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-3102) Few issues with CachingCollector

2011-05-18 Thread Michael McCandless (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-3102?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless updated LUCENE-3102:
---

Attachment: LUCENE-3102.patch

Patch.

I think I fixed TestGrouping to exercise the no wrapped collector and replay 
twice case for CachingCollector.

 Few issues with CachingCollector
 

 Key: LUCENE-3102
 URL: https://issues.apache.org/jira/browse/LUCENE-3102
 Project: Lucene - Java
  Issue Type: Bug
  Components: core/search
Reporter: Shai Erera
Assignee: Shai Erera
Priority: Minor
 Fix For: 3.2, 4.0

 Attachments: LUCENE-3102-factory.patch, LUCENE-3102-nowrap.patch, 
 LUCENE-3102-nowrap.patch, LUCENE-3102.patch, LUCENE-3102.patch, 
 LUCENE-3102.patch


 CachingCollector (introduced in LUCENE-1421) has few issues:
 # Since the wrapped Collector may support out-of-order collection, the 
 document IDs cached may be out-of-order (depends on the Query) and thus 
 replay(Collector) will forward document IDs out-of-order to a Collector that 
 may not support it.
 # It does not clear cachedScores + cachedSegs upon exceeding RAM limits
 # I think that instead of comparing curScores to null, in order to determine 
 if scores are requested, we should have a specific boolean - for clarity
 # This check if (base + nextLength  maxDocsToCache) (line 168) can be 
 relaxed? E.g., what if nextLength is, say, 512K, and I cannot satisfy the 
 maxDocsToCache constraint, but if it was 10K I would? Wouldn't we still want 
 to try and cache them?
 Also:
 * The TODO in line 64 (having Collector specify needsScores()) -- why do we 
 need that if CachingCollector ctor already takes a boolean cacheScores? I 
 think it's better defined explicitly than implicitly?
 * Let's introduce a factory method for creating a specialized version if 
 scoring is requested / not (i.e., impl the TODO in line 189)
 * I think it's a useful collector, which stands on its own and not specific 
 to grouping. Can we move it to core?
 * How about using OpenBitSet instead of int[] for doc IDs?
 ** If the number of hits is big, we'd gain some RAM back, and be able to 
 cache more entries
 ** NOTE: OpenBitSet can only be used for in-order collection only. So we can 
 use that if the wrapped Collector does not support out-of-order
 * Do you think we can modify this Collector to not necessarily wrap another 
 Collector? We have such Collector which stores (in-memory) all matching doc 
 IDs + scores (if required). Those are later fed into several processes that 
 operate on them (e.g. fetch more info from the index etc.). I am thinking, we 
 can make CachingCollector *optionally* wrap another Collector and then 
 someone can reuse it by setting RAM limit to unlimited (we should have a 
 constant for that) in order to simply collect all matching docs + scores.
 * I think a set of dedicated unit tests for this class alone would be good.
 That's it so far. Perhaps, if we do all of the above, more things will pop up.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-3102) Few issues with CachingCollector

2011-05-17 Thread Shai Erera (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-3102?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shai Erera updated LUCENE-3102:
---

Attachment: LUCENE-3102-factory.patch

Patch against 3x which:

* Adds factory method to CachingCollector, specializing on cacheScores
* Clarify Collector.needScores() TODO

There are two remaining issues, let's address them after we iterate on this 
patch.

 Few issues with CachingCollector
 

 Key: LUCENE-3102
 URL: https://issues.apache.org/jira/browse/LUCENE-3102
 Project: Lucene - Java
  Issue Type: Bug
  Components: core/search
Reporter: Shai Erera
Assignee: Shai Erera
Priority: Minor
 Fix For: 3.2, 4.0

 Attachments: LUCENE-3102-factory.patch, LUCENE-3102.patch, 
 LUCENE-3102.patch


 CachingCollector (introduced in LUCENE-1421) has few issues:
 # Since the wrapped Collector may support out-of-order collection, the 
 document IDs cached may be out-of-order (depends on the Query) and thus 
 replay(Collector) will forward document IDs out-of-order to a Collector that 
 may not support it.
 # It does not clear cachedScores + cachedSegs upon exceeding RAM limits
 # I think that instead of comparing curScores to null, in order to determine 
 if scores are requested, we should have a specific boolean - for clarity
 # This check if (base + nextLength  maxDocsToCache) (line 168) can be 
 relaxed? E.g., what if nextLength is, say, 512K, and I cannot satisfy the 
 maxDocsToCache constraint, but if it was 10K I would? Wouldn't we still want 
 to try and cache them?
 Also:
 * The TODO in line 64 (having Collector specify needsScores()) -- why do we 
 need that if CachingCollector ctor already takes a boolean cacheScores? I 
 think it's better defined explicitly than implicitly?
 * Let's introduce a factory method for creating a specialized version if 
 scoring is requested / not (i.e., impl the TODO in line 189)
 * I think it's a useful collector, which stands on its own and not specific 
 to grouping. Can we move it to core?
 * How about using OpenBitSet instead of int[] for doc IDs?
 ** If the number of hits is big, we'd gain some RAM back, and be able to 
 cache more entries
 ** NOTE: OpenBitSet can only be used for in-order collection only. So we can 
 use that if the wrapped Collector does not support out-of-order
 * Do you think we can modify this Collector to not necessarily wrap another 
 Collector? We have such Collector which stores (in-memory) all matching doc 
 IDs + scores (if required). Those are later fed into several processes that 
 operate on them (e.g. fetch more info from the index etc.). I am thinking, we 
 can make CachingCollector *optionally* wrap another Collector and then 
 someone can reuse it by setting RAM limit to unlimited (we should have a 
 constant for that) in order to simply collect all matching docs + scores.
 * I think a set of dedicated unit tests for this class alone would be good.
 That's it so far. Perhaps, if we do all of the above, more things will pop up.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-3102) Few issues with CachingCollector

2011-05-16 Thread Shai Erera (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-3102?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shai Erera updated LUCENE-3102:
---

Lucene Fields: [New, Patch Available]  (was: [New])

 Few issues with CachingCollector
 

 Key: LUCENE-3102
 URL: https://issues.apache.org/jira/browse/LUCENE-3102
 Project: Lucene - Java
  Issue Type: Bug
  Components: contrib/*
Reporter: Shai Erera
Priority: Minor
 Fix For: 3.2, 4.0

 Attachments: LUCENE-3102.patch


 CachingCollector (introduced in LUCENE-1421) has few issues:
 # Since the wrapped Collector may support out-of-order collection, the 
 document IDs cached may be out-of-order (depends on the Query) and thus 
 replay(Collector) will forward document IDs out-of-order to a Collector that 
 may not support it.
 # It does not clear cachedScores + cachedSegs upon exceeding RAM limits
 # I think that instead of comparing curScores to null, in order to determine 
 if scores are requested, we should have a specific boolean - for clarity
 # This check if (base + nextLength  maxDocsToCache) (line 168) can be 
 relaxed? E.g., what if nextLength is, say, 512K, and I cannot satisfy the 
 maxDocsToCache constraint, but if it was 10K I would? Wouldn't we still want 
 to try and cache them?
 Also:
 * The TODO in line 64 (having Collector specify needsScores()) -- why do we 
 need that if CachingCollector ctor already takes a boolean cacheScores? I 
 think it's better defined explicitly than implicitly?
 * Let's introduce a factory method for creating a specialized version if 
 scoring is requested / not (i.e., impl the TODO in line 189)
 * I think it's a useful collector, which stands on its own and not specific 
 to grouping. Can we move it to core?
 * How about using OpenBitSet instead of int[] for doc IDs?
 ** If the number of hits is big, we'd gain some RAM back, and be able to 
 cache more entries
 ** NOTE: OpenBitSet can only be used for in-order collection only. So we can 
 use that if the wrapped Collector does not support out-of-order
 * Do you think we can modify this Collector to not necessarily wrap another 
 Collector? We have such Collector which stores (in-memory) all matching doc 
 IDs + scores (if required). Those are later fed into several processes that 
 operate on them (e.g. fetch more info from the index etc.). I am thinking, we 
 can make CachingCollector *optionally* wrap another Collector and then 
 someone can reuse it by setting RAM limit to unlimited (we should have a 
 constant for that) in order to simply collect all matching docs + scores.
 * I think a set of dedicated unit tests for this class alone would be good.
 That's it so far. Perhaps, if we do all of the above, more things will pop up.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-3102) Few issues with CachingCollector

2011-05-16 Thread Shai Erera (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-3102?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shai Erera updated LUCENE-3102:
---

Attachment: LUCENE-3102.patch

Patch includes the bug fixes + test. Still none of the items I listed after 
'Also ...'. I plan to tackle that next, in subsequent patches.

Question -- perhaps we can commit these changes incrementally? I.e., after we 
iterate on the changes in this patch, if they are ok, commit them, then do the 
rest of the stuff? Or a single commit w/ everything is preferable?

Mike, there is another reason to separate Collector.needsScores() from 
cacheScores -- it is possible someone will pass a Collector which needs scores, 
however won't want to have CachingCollector 'cache' them. In which case, the 
wrapped Collector should be delegated setScorer instead of cachedScorer.

I will leave Collector.needsScores() for a different issue though?

 Few issues with CachingCollector
 

 Key: LUCENE-3102
 URL: https://issues.apache.org/jira/browse/LUCENE-3102
 Project: Lucene - Java
  Issue Type: Bug
  Components: contrib/*
Reporter: Shai Erera
Priority: Minor
 Fix For: 3.2, 4.0

 Attachments: LUCENE-3102.patch


 CachingCollector (introduced in LUCENE-1421) has few issues:
 # Since the wrapped Collector may support out-of-order collection, the 
 document IDs cached may be out-of-order (depends on the Query) and thus 
 replay(Collector) will forward document IDs out-of-order to a Collector that 
 may not support it.
 # It does not clear cachedScores + cachedSegs upon exceeding RAM limits
 # I think that instead of comparing curScores to null, in order to determine 
 if scores are requested, we should have a specific boolean - for clarity
 # This check if (base + nextLength  maxDocsToCache) (line 168) can be 
 relaxed? E.g., what if nextLength is, say, 512K, and I cannot satisfy the 
 maxDocsToCache constraint, but if it was 10K I would? Wouldn't we still want 
 to try and cache them?
 Also:
 * The TODO in line 64 (having Collector specify needsScores()) -- why do we 
 need that if CachingCollector ctor already takes a boolean cacheScores? I 
 think it's better defined explicitly than implicitly?
 * Let's introduce a factory method for creating a specialized version if 
 scoring is requested / not (i.e., impl the TODO in line 189)
 * I think it's a useful collector, which stands on its own and not specific 
 to grouping. Can we move it to core?
 * How about using OpenBitSet instead of int[] for doc IDs?
 ** If the number of hits is big, we'd gain some RAM back, and be able to 
 cache more entries
 ** NOTE: OpenBitSet can only be used for in-order collection only. So we can 
 use that if the wrapped Collector does not support out-of-order
 * Do you think we can modify this Collector to not necessarily wrap another 
 Collector? We have such Collector which stores (in-memory) all matching doc 
 IDs + scores (if required). Those are later fed into several processes that 
 operate on them (e.g. fetch more info from the index etc.). I am thinking, we 
 can make CachingCollector *optionally* wrap another Collector and then 
 someone can reuse it by setting RAM limit to unlimited (we should have a 
 constant for that) in order to simply collect all matching docs + scores.
 * I think a set of dedicated unit tests for this class alone would be good.
 That's it so far. Perhaps, if we do all of the above, more things will pop up.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-3102) Few issues with CachingCollector

2011-05-16 Thread Shai Erera (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-3102?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shai Erera updated LUCENE-3102:
---

Attachment: LUCENE-3102.patch

bq. Only thing is: I would be careful about directly setting those private 
fields of the cachedScorer; I think (not sure) this incurs an access check on 
each assignment. Maybe make them package protected? Or use a setter?

Good catch Mike. I read about it some and found this nice webpage which 
explains the implications (http://www.glenmccl.com/jperf/). Indeed, if the 
member is private (whether it's in the inner or outer class), there is an 
access check. So the right think to do is to declare is protected / 
package-private, which I did. Thanks for the opportunity to get some education !

Patch fixes this. I intend to commit this shortly + move the class to core + 
apply to trunk. Then, I'll continue w/ the rest of the improvements.

 Few issues with CachingCollector
 

 Key: LUCENE-3102
 URL: https://issues.apache.org/jira/browse/LUCENE-3102
 Project: Lucene - Java
  Issue Type: Bug
  Components: contrib/*
Reporter: Shai Erera
Priority: Minor
 Fix For: 3.2, 4.0

 Attachments: LUCENE-3102.patch, LUCENE-3102.patch


 CachingCollector (introduced in LUCENE-1421) has few issues:
 # Since the wrapped Collector may support out-of-order collection, the 
 document IDs cached may be out-of-order (depends on the Query) and thus 
 replay(Collector) will forward document IDs out-of-order to a Collector that 
 may not support it.
 # It does not clear cachedScores + cachedSegs upon exceeding RAM limits
 # I think that instead of comparing curScores to null, in order to determine 
 if scores are requested, we should have a specific boolean - for clarity
 # This check if (base + nextLength  maxDocsToCache) (line 168) can be 
 relaxed? E.g., what if nextLength is, say, 512K, and I cannot satisfy the 
 maxDocsToCache constraint, but if it was 10K I would? Wouldn't we still want 
 to try and cache them?
 Also:
 * The TODO in line 64 (having Collector specify needsScores()) -- why do we 
 need that if CachingCollector ctor already takes a boolean cacheScores? I 
 think it's better defined explicitly than implicitly?
 * Let's introduce a factory method for creating a specialized version if 
 scoring is requested / not (i.e., impl the TODO in line 189)
 * I think it's a useful collector, which stands on its own and not specific 
 to grouping. Can we move it to core?
 * How about using OpenBitSet instead of int[] for doc IDs?
 ** If the number of hits is big, we'd gain some RAM back, and be able to 
 cache more entries
 ** NOTE: OpenBitSet can only be used for in-order collection only. So we can 
 use that if the wrapped Collector does not support out-of-order
 * Do you think we can modify this Collector to not necessarily wrap another 
 Collector? We have such Collector which stores (in-memory) all matching doc 
 IDs + scores (if required). Those are later fed into several processes that 
 operate on them (e.g. fetch more info from the index etc.). I am thinking, we 
 can make CachingCollector *optionally* wrap another Collector and then 
 someone can reuse it by setting RAM limit to unlimited (we should have a 
 constant for that) in order to simply collect all matching docs + scores.
 * I think a set of dedicated unit tests for this class alone would be good.
 That's it so far. Perhaps, if we do all of the above, more things will pop up.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-3102) Few issues with CachingCollector

2011-05-16 Thread Shai Erera (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-3102?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shai Erera updated LUCENE-3102:
---

Component/s: (was: contrib/*)
 modules/grouping

 Few issues with CachingCollector
 

 Key: LUCENE-3102
 URL: https://issues.apache.org/jira/browse/LUCENE-3102
 Project: Lucene - Java
  Issue Type: Bug
  Components: modules/grouping
Reporter: Shai Erera
Priority: Minor
 Fix For: 3.2, 4.0

 Attachments: LUCENE-3102.patch, LUCENE-3102.patch


 CachingCollector (introduced in LUCENE-1421) has few issues:
 # Since the wrapped Collector may support out-of-order collection, the 
 document IDs cached may be out-of-order (depends on the Query) and thus 
 replay(Collector) will forward document IDs out-of-order to a Collector that 
 may not support it.
 # It does not clear cachedScores + cachedSegs upon exceeding RAM limits
 # I think that instead of comparing curScores to null, in order to determine 
 if scores are requested, we should have a specific boolean - for clarity
 # This check if (base + nextLength  maxDocsToCache) (line 168) can be 
 relaxed? E.g., what if nextLength is, say, 512K, and I cannot satisfy the 
 maxDocsToCache constraint, but if it was 10K I would? Wouldn't we still want 
 to try and cache them?
 Also:
 * The TODO in line 64 (having Collector specify needsScores()) -- why do we 
 need that if CachingCollector ctor already takes a boolean cacheScores? I 
 think it's better defined explicitly than implicitly?
 * Let's introduce a factory method for creating a specialized version if 
 scoring is requested / not (i.e., impl the TODO in line 189)
 * I think it's a useful collector, which stands on its own and not specific 
 to grouping. Can we move it to core?
 * How about using OpenBitSet instead of int[] for doc IDs?
 ** If the number of hits is big, we'd gain some RAM back, and be able to 
 cache more entries
 ** NOTE: OpenBitSet can only be used for in-order collection only. So we can 
 use that if the wrapped Collector does not support out-of-order
 * Do you think we can modify this Collector to not necessarily wrap another 
 Collector? We have such Collector which stores (in-memory) all matching doc 
 IDs + scores (if required). Those are later fed into several processes that 
 operate on them (e.g. fetch more info from the index etc.). I am thinking, we 
 can make CachingCollector *optionally* wrap another Collector and then 
 someone can reuse it by setting RAM limit to unlimited (we should have a 
 constant for that) in order to simply collect all matching docs + scores.
 * I think a set of dedicated unit tests for this class alone would be good.
 That's it so far. Perhaps, if we do all of the above, more things will pop up.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-3102) Few issues with CachingCollector

2011-05-16 Thread Shai Erera (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-3102?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shai Erera updated LUCENE-3102:
---

Component/s: (was: modules/grouping)
 core/search

 Few issues with CachingCollector
 

 Key: LUCENE-3102
 URL: https://issues.apache.org/jira/browse/LUCENE-3102
 Project: Lucene - Java
  Issue Type: Bug
  Components: core/search
Reporter: Shai Erera
Assignee: Shai Erera
Priority: Minor
 Fix For: 3.2, 4.0

 Attachments: LUCENE-3102.patch, LUCENE-3102.patch


 CachingCollector (introduced in LUCENE-1421) has few issues:
 # Since the wrapped Collector may support out-of-order collection, the 
 document IDs cached may be out-of-order (depends on the Query) and thus 
 replay(Collector) will forward document IDs out-of-order to a Collector that 
 may not support it.
 # It does not clear cachedScores + cachedSegs upon exceeding RAM limits
 # I think that instead of comparing curScores to null, in order to determine 
 if scores are requested, we should have a specific boolean - for clarity
 # This check if (base + nextLength  maxDocsToCache) (line 168) can be 
 relaxed? E.g., what if nextLength is, say, 512K, and I cannot satisfy the 
 maxDocsToCache constraint, but if it was 10K I would? Wouldn't we still want 
 to try and cache them?
 Also:
 * The TODO in line 64 (having Collector specify needsScores()) -- why do we 
 need that if CachingCollector ctor already takes a boolean cacheScores? I 
 think it's better defined explicitly than implicitly?
 * Let's introduce a factory method for creating a specialized version if 
 scoring is requested / not (i.e., impl the TODO in line 189)
 * I think it's a useful collector, which stands on its own and not specific 
 to grouping. Can we move it to core?
 * How about using OpenBitSet instead of int[] for doc IDs?
 ** If the number of hits is big, we'd gain some RAM back, and be able to 
 cache more entries
 ** NOTE: OpenBitSet can only be used for in-order collection only. So we can 
 use that if the wrapped Collector does not support out-of-order
 * Do you think we can modify this Collector to not necessarily wrap another 
 Collector? We have such Collector which stores (in-memory) all matching doc 
 IDs + scores (if required). Those are later fed into several processes that 
 operate on them (e.g. fetch more info from the index etc.). I am thinking, we 
 can make CachingCollector *optionally* wrap another Collector and then 
 someone can reuse it by setting RAM limit to unlimited (we should have a 
 constant for that) in order to simply collect all matching docs + scores.
 * I think a set of dedicated unit tests for this class alone would be good.
 That's it so far. Perhaps, if we do all of the above, more things will pop up.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org