[
https://issues.apache.org/jira/browse/LUCENE-3102?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13033784#comment-13033784
]
Michael McCandless commented on LUCENE-3102:
--------------------------------------------
Great catches Shai -- thanks for the thorough review!
bq. Since the wrapped Collector may support out-of-order collection, the
document IDs cached may be out-of-order (depends on the Query) and thus
replay(Collector) will forward document IDs out-of-order to a Collector that
may not support it.
Ahh... maybe we throw IllegalArgExc if the replay'd collector requires
in-order but the first pass collector didn't?
bq. It does not clear cachedScores + cachedSegs upon exceeding RAM limits
Hmm, I think it does clear cachedScores? (But not cachedSegs).
bq. I think that instead of comparing curScores to null, in order to determine
if scores are requested, we should have a specific boolean - for clarity
That sounds great!
bq. This check "if (base + nextLength > maxDocsToCache)" (line 168) can be
relaxed? E.g., what if nextLength is, say, 512K, and I cannot satisfy the
maxDocsToCache constraint, but if it was 10K I would? Wouldn't we still want to
try and cache them?
Oh you mean for the last "chunk" we could alloc right up to the limit?
Good!
bq. The TODO in line 64 (having Collector specify needsScores()) – why do we
need that if CachingCollector ctor already takes a boolean "cacheScores"? I
think it's better defined explicitly than implicitly?
Yes, I think we should keep the explicit boolean (cacheScores), but eg
you could mess up (pass cacheScores = false but then pass a collector
that calls .score()) -- that's why I added to TODO. Ie, it'd be nice
if we could "verify" that the collector agrees we don't need scores.
I think there were other places in Lucene where knowing this up front
could help us... can't remember the details.
bq. Let's introduce a factory method for creating a specialized version if
scoring is requested / not (i.e., impl the TODO in line 189)
+1
bq. I think it's a useful collector, which stands on its own and not specific
to grouping. Can we move it to core?
+1
{quote}
How about using OpenBitSet instead of int[] for doc IDs?
If the number of hits is big, we'd gain some RAM back, and be able to cache
more entries
NOTE: OpenBitSet can only be used for in-order collection only. So we can use
that if the wrapped Collector does not support out-of-order
{quote}
Hmm but if the number of hits is small we spend un-needed RAM/CPU,
but, then that tradeoff is maybe OK? I'm just worried about indices
w/ lots of docs... we could also "upgrade" to a bit set part way
through, since it's so clear where the cutoff is.
bq. Do you think we can modify this Collector to not necessarily wrap another
Collector? We have such Collector which stores (in-memory) all matching doc IDs
+ scores (if required). Those are later fed into several processes that operate
on them (e.g. fetch more info from the index etc.). I am thinking, we can make
CachingCollector optionally wrap another Collector and then someone can reuse
it by setting RAM limit to unlimited (we should have a constant for that) in
order to simply collect all matching docs + scores.
I'd actually rather not have the constant -- ie, I don't want to make
it easy to be unlimited? It seems too dangerous... I'd rather your
code has to spell out 10*1024 so you realize you're saying 10 GB (for
example).
bq. I think a set of dedicated unit tests for this class alone would be good.
+1
Awesome feedback!! Are you planning to work up a patch for these...?
> Few issues with CachingCollector
> --------------------------------
>
> Key: LUCENE-3102
> URL: https://issues.apache.org/jira/browse/LUCENE-3102
> Project: Lucene - Java
> Issue Type: Bug
> Components: contrib/*
> Reporter: Shai Erera
> Priority: Minor
> Fix For: 3.2, 4.0
>
>
> CachingCollector (introduced in LUCENE-1421) has few issues:
> # Since the wrapped Collector may support out-of-order collection, the
> document IDs cached may be out-of-order (depends on the Query) and thus
> replay(Collector) will forward document IDs out-of-order to a Collector that
> may not support it.
> # It does not clear cachedScores + cachedSegs upon exceeding RAM limits
> # I think that instead of comparing curScores to null, in order to determine
> if scores are requested, we should have a specific boolean - for clarity
> # This check "if (base + nextLength > maxDocsToCache)" (line 168) can be
> relaxed? E.g., what if nextLength is, say, 512K, and I cannot satisfy the
> maxDocsToCache constraint, but if it was 10K I would? Wouldn't we still want
> to try and cache them?
> Also:
> * The TODO in line 64 (having Collector specify needsScores()) -- why do we
> need that if CachingCollector ctor already takes a boolean "cacheScores"? I
> think it's better defined explicitly than implicitly?
> * Let's introduce a factory method for creating a specialized version if
> scoring is requested / not (i.e., impl the TODO in line 189)
> * I think it's a useful collector, which stands on its own and not specific
> to grouping. Can we move it to core?
> * How about using OpenBitSet instead of int[] for doc IDs?
> ** If the number of hits is big, we'd gain some RAM back, and be able to
> cache more entries
> ** NOTE: OpenBitSet can only be used for in-order collection only. So we can
> use that if the wrapped Collector does not support out-of-order
> * Do you think we can modify this Collector to not necessarily wrap another
> Collector? We have such Collector which stores (in-memory) all matching doc
> IDs + scores (if required). Those are later fed into several processes that
> operate on them (e.g. fetch more info from the index etc.). I am thinking, we
> can make CachingCollector *optionally* wrap another Collector and then
> someone can reuse it by setting RAM limit to unlimited (we should have a
> constant for that) in order to simply collect all matching docs + scores.
> * I think a set of dedicated unit tests for this class alone would be good.
> That's it so far. Perhaps, if we do all of the above, more things will pop up.
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]