[jira] [Commented] (LUCENE-3102) Few issues with CachingCollector

Shai Erera (JIRA) Tue, 17 May 2011 23:53:00 -0700

    [ 
https://issues.apache.org/jira/browse/LUCENE-3102?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13035223#comment-13035223
 ]


Shai Erera commented on LUCENE-3102:
------------------------------------

There are two things left to do:

(1) Use bit set instead of int[] for docIDs. If we do this, then it means the 
Collector cannot support out-of-order collections (which is not a big deal 
IMO). It also means for large indexes, we might consume more RAM than int[].

(2) Allow this Collector to stand on its own, w/o necessarily wrapping another 
Collector. There are several ways we can achieve that:
* Take a 'null' Collector and check other != null. Adds an 'if' but not a big 
deal IMO. Also, acceptDocsOutOfOrder will have to either return false (or 
true), or we take that as a parameter.
* Take a 'null' Collector and set this.other to a private static instance of a 
NoOpCollector. We'll still be delegating calls to it, but hopefully it won't be 
expensive. Same issue w/ out-of-order
* Create two specialized variants of CachingCollector.

Personally I'm not too much in favor of the last option - too much code dup for 
not much gain.

The option I like the most is the 2nd (introducing a NoOpCollector). We can 
even introduce it as a public static member of CachingCollector and let users 
decide if they want to use it or not. For ease of use, we can still allow 
'null' to be passed to create().

What do you think?

> Few issues with CachingCollector
> --------------------------------
>
>                 Key: LUCENE-3102
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3102
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: core/search
>            Reporter: Shai Erera
>            Assignee: Shai Erera
>            Priority: Minor
>             Fix For: 3.2, 4.0
>
>         Attachments: LUCENE-3102-factory.patch, LUCENE-3102.patch, 
> LUCENE-3102.patch
>
>
> CachingCollector (introduced in LUCENE-1421) has few issues:
> # Since the wrapped Collector may support out-of-order collection, the 
> document IDs cached may be out-of-order (depends on the Query) and thus 
> replay(Collector) will forward document IDs out-of-order to a Collector that 
> may not support it.
> # It does not clear cachedScores + cachedSegs upon exceeding RAM limits
> # I think that instead of comparing curScores to null, in order to determine 
> if scores are requested, we should have a specific boolean - for clarity
> # This check "if (base + nextLength > maxDocsToCache)" (line 168) can be 
> relaxed? E.g., what if nextLength is, say, 512K, and I cannot satisfy the 
> maxDocsToCache constraint, but if it was 10K I would? Wouldn't we still want 
> to try and cache them?
> Also:
> * The TODO in line 64 (having Collector specify needsScores()) -- why do we 
> need that if CachingCollector ctor already takes a boolean "cacheScores"? I 
> think it's better defined explicitly than implicitly?
> * Let's introduce a factory method for creating a specialized version if 
> scoring is requested / not (i.e., impl the TODO in line 189)
> * I think it's a useful collector, which stands on its own and not specific 
> to grouping. Can we move it to core?
> * How about using OpenBitSet instead of int[] for doc IDs?
> ** If the number of hits is big, we'd gain some RAM back, and be able to 
> cache more entries
> ** NOTE: OpenBitSet can only be used for in-order collection only. So we can 
> use that if the wrapped Collector does not support out-of-order
> * Do you think we can modify this Collector to not necessarily wrap another 
> Collector? We have such Collector which stores (in-memory) all matching doc 
> IDs + scores (if required). Those are later fed into several processes that 
> operate on them (e.g. fetch more info from the index etc.). I am thinking, we 
> can make CachingCollector *optionally* wrap another Collector and then 
> someone can reuse it by setting RAM limit to unlimited (we should have a 
> constant for that) in order to simply collect all matching docs + scores.
> * I think a set of dedicated unit tests for this class alone would be good.
> That's it so far. Perhaps, if we do all of the above, more things will pop up.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-3102) Few issues with CachingCollector

Reply via email to