Re: [jira] Commented: (LUCENE-831) Complete overhaul of FieldCache API/Implementation

Mark Miller Mon, 08 Dec 2008 08:25:31 -0800

I tried a quick poor mans version using a MultiSearcher and wrapping thesub readers as searchers. Other than some AUTO sort field detectionproblems, all tests do appear to pass. The new sort stuff forMultiSearcher may be a tiny bit off...sort tests fail, though are onlyslightly off, with that patch. Havn't looked further yet - just hackedit up real quick. Seems to work, but needs work.


- Mark



Michael McCandless wrote:

Mark Miller wrote:
Michael McCandless wrote:
Mark Miller wrote:
What do we get from this though? A MultiSearcher (with the scoringissues) that can properly do rewrite? Won't we have to takeMultiSearchers scoring baggage into this as well?
If this can work, what we'd get is far better reopen() performance
when you sort-by-field, with no change to the returned results
(rewrite, scores, sort order are identical).

Say you have 1MM doc index, and then you add 100 docs & commit.
Today, when you reopen() and then do a search, FieldCache recomputes
from scratch (iterating through all Terms in entire index) the global
arrays for the fields you're sorting on.  The cost is in proportion to
total index size.

With this change, only the new segment's terms will be iterated on, so
the cost is in proportion to what new segments appeared.

This is the same benefit we are seeking with LUCENE-831, for all uses
of FieldCache (not just sort-by-field), it's just that I think we can
achieve this speedup to sort-by-field without LUCENE-831.
Yup, I'm with you on all that. Except the without LUCENE-831 part -we need some FieldCache meddling right? The current FieldCacheapproach doesn't allow us to meddle much. Isn't it more like, we wantthe LUCENE-831 API (or something similar), but we won't need theobjectarray or merge stuff?
We wouldn't need any change to FieldCache, because we only askFieldCache for int[] (eg) on the SegmentReader instances. Becausereopen() shares SegmentReader instances, only the new segments wouldhave a cache miss in FieldCache. I think?
Once we do LUCENE-831, minus objectarray and merging, this changewould be basically the same, ie, accessing per-segment int values,just with a new API. Ie, by doing this change first I don't thinkwe're going to waste much in then cutting over in the future toLUCENE-831's API (vs waiting for LUCENE-831 api).
I think there would be no change to the scoring: we would still create
a Weight based on the toplevel IndexReader, but then search each
sub-reader separately, using that Weight.

Though... that is unusual (to create a Weight with the parent
IndexSearcher and then use it in the sub-searchers) -- will something
break if we do that?  (This is new territory for me).
Okay, right. That does change things. Would love to hear moreopinions, but that certainly seems reasonable to me. You score eachsegment using tf/idf stats from all of the segments.
That's my expectation (hope). So the results are identical butperformance is much better.
If something will break, I think we can still achieve this, but it
will be a more invasive change and probably will have to be re-coupled
to the new API we will introduce with LUCENE-831.  Marvin actually
referred to how to do this, here:
https://issues.apache.org/jira/browse/LUCENE-1458?focusedCommentId=12650854#action_12650854
in the paragraph starting with "If our goal is minimal impact...".
Basically during collection, the FieldSortedHitQueue would have to
keep track of subReaderIndex/subReaderDocID (mapping, through
iteration, from the primary docID w/o doing a wasteful new binary
search for each) and enroll into different pqueues indexed by
subReaderIndex, then do the merge sort in the end.

Mike
Michael McCandless wrote:
On thinking more about this... I think with a few small changes we
could achieve Sort by field without materializing a full array.  We
can decouple this change from LUCENE-831.

I think all that's needed is:

* Expose sub-readers (LUCENE-1475) by adding IndexReader[]
  IndexReader.getSubReaders.  Default impl could just return
  length-1 array of itself.

* Change IndexSearcher.sort that takes a Sort, to first call
  IndexReader.getSubReaders, and then do the same logic that
  MultiSearcher does, with improvements from LUCENE-1471 (run
  separate search per-reader, then merge-sort the top hits from
  each).

The results should be functionally identical to what we have today,
but, searching after doing a reopen() should be much faster sincewe'd
no longer re-build the global FieldCache array.

Does this make sense?  It's a small change for a big win, I think.
Does anyone want to take a crack at this patch?

Mike

Mark Miller wrote:
Michael McCandless wrote:
I'd like to decouple "upgraded to Object" vs "materialize fullarray", ie, so we can access native values w/o materializing thefull array. I also think "upgrade to Object" is dangerous toeven offer since it's so costly.
I'm right with you. I didn't think the Object approach was reallyan upgrade (beyond losing the merge, which is especiallyimportant for StringIndex - it has no merge option at the moment)which is why I left both options for now. So I def agree we needto move to iterator, drop object, etc.
Its the doin' that aint so easy. The iterator approach seemssomewhat straightforward (though its complicated by needing toprovide a random access object as well), but I'm still workingthrough how we control so many iterator types (I dont see how youcan use polymorphism yet ).
- Mark
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: [jira] Commented: (LUCENE-831) Complete overhaul of FieldCache API/Implementation

Reply via email to