Great, thanks for the explanation Adrien!  I still learn something new
about Lucene every day ;)

Mike McCandless

http://blog.mikemccandless.com


On Mon, May 24, 2021 at 1:38 PM Adrien Grand <jpou...@gmail.com> wrote:

> Indeed collectors keep state about the minimum required score for a hit to
> be competitive, typically the next float after the score of the bottom of
> the priority queue. It can only go up as more documents/segments get
> collected. Collectors have a feedback loop with scorers that enables
> scorers to skip hits that cannot possibly make it to the top hits.
>
> If you can put segments that produce higher scores first, you will get
> better performance. I don't think it can be easily done when sorting by
> score, but when sorting by a field that correlates with index order, it
> certainly can. This is the point of LUCENE-9507.
>
> With multiple slices/threads, every slice gets its own collector, but
> LUCENE-8978 made collectors able to share information about their minimum
> competitive scores across threads so that seeing good hits on a segment
> could still help skip hits on other segments like when searching without a
> threadpool.
>
> A possibly surprising consequence is that running the same query multiple
> times could return different hit count estimations when IndexSearcher is
> configured with a threadpool.
>
>
> Le lun. 24 mai 2021 à 15:55, Michael McCandless <luc...@mikemccandless.com>
> a écrit :
>
>> OK, it sounds like this is indeed expected (thanks Adrien, Greg, Mike,
>> Patrick!), that the estimated hit count (GREATER_THAN_OR_EQUAL_TO) can
>> indeed change if the segment order is different but otherwise identical
>> documents are in each segment (yes, with different global doc ids, so the
>> true identical document case would tie break by those).
>>
>> That was/is surprising to me -- I thought segment order did not alter
>> anything at search time (except the tie-break-by-global-docid).  I thought
>> each segment was searched, independently, by itself, and then results
>> "merged" together.  But this has become more complex/interesting with time
>> :)
>>
>> So is this because BMW will use the best results from previous segments
>> to optimize/improve skipping while searching future segments?  I.e. some
>> sort of state carries over from having searched past segments, to better
>> optimize searching the next segment?  This makes total sense (that we would
>> optimize this way). I guess this also means that concurrent search ("thread
>> per slice/segment") would not be able to optimize with BMW as well?
>>
>> Mike McCandless
>>
>> http://blog.mikemccandless.com
>>
>>
>> On Sun, May 23, 2021 at 8:24 AM Michael Sokolov <msoko...@gmail.com>
>> wrote:
>>
>>> Agree with Adrien and Greg, approximate hit counts may vary. That global
>>> docid tiebreaking may affect the actual results is an interesting
>>> observation too, thanks Adrien. It seems to me the best way forward is to
>>> relax the check. Depending on how the index is constructed the difference
>>> could be quite large. Eg if the index is sorted and all the best documents
>>> are in one segment, we could end up with quite different counts; they could
>>> get many multiples different if you collect single threaded.
>>>
>>> On Fri, May 21, 2021, 2:22 PM Greg Miller <gsmil...@gmail.com> wrote:
>>>
>>>> Hi Patrick-
>>>>
>>>> Interesting finds and deep dive!
>>>>
>>>> Just to confirm, in the cases of differing hit counts that you've
>>>> observed, these are TotalHits with a Relation of
>>>> "GREATER_THAN_OR_EQUAL_TO" right? Never cases where the Relation is
>>>> "EQUAL_TO"? I ask because that would change my opinion of
>>>> whether-or-not this is concerning. If we're proving a hit count with
>>>> "equal to" semantics, it should be correct and shouldn't change based
>>>> on segment ordering, etc. But, on the other hand, if we're just
>>>> providing a floor (i.e., "there are at least this many hits"), then
>>>> I'm not particularly concerned by this (assuming that floor is
>>>> actually correct in call cases and there aren't fewer hits).
>>>>
>>>> Cheers,
>>>> -Greg
>>>>
>>>> On Fri, May 21, 2021 at 2:25 AM Adrien Grand <jpou...@gmail.com> wrote:
>>>> >
>>>> > Hi Patrick,
>>>> >
>>>> >
>>>> > Why do you feel weird about the fact that segment order impacts the
>>>> hit count estimation? It feels ok to me, especially as segment order has
>>>> deeper implications, e.g. you could get different top hits given that
>>>> Lucene uses the global doc ID as a tie breaker for documents that produce
>>>> the same score.
>>>> >
>>>> > Could IndexRearranger write segments in deterministic order in the
>>>> segments file to improve reproducibility? Or could your application
>>>> configure leaf order explicitly? (LUCENE-9507)
>>>> >
>>>> > Le ven. 21 mai 2021 à 08:29, Patrick Zhai <zhai7...@gmail.com> a
>>>> écrit :
>>>> >>
>>>> >> Hi folks
>>>> >>
>>>> >> For the past few weeks I've been working with Mike McCandless to use
>>>> the recent introduced IndexRearranger to replace the old way of guarantee
>>>> deterministic index -- using a single index thread and a LogDocMergePolicy.
>>>> >>
>>>> >> In the progress we found out that with two concurrently built but
>>>> rearranged indexes, the estimation hit count will show a small difference.
>>>> I've carefully checked the index and found they're almost the same but the
>>>> segment order is different (index 1 might be segment 1,2,3,4,5 while index
>>>> 2 might be segment 2,1,3,5,4 where nth segment contains exactly the same
>>>> documents and sorted using the same criteria). So I suspected the segment
>>>> order impacted the hit count estimation and to confirm that I turned off
>>>> the concurrency of rearranger so that it will always create segments in
>>>> order. The result proved my theory that the segment order was impacting the
>>>> hit count estimation.
>>>> >>
>>>> >> Later on I did some investigation and found in TopScoreDocCollector
>>>> we do have logic of updating the global minScore so I guess that's where
>>>> makes the difference. Mike and I both feel a little weird that segment
>>>> order will affect the hit count estimation, so just want to
>>>> >> 1. See whether there's any chance we could improve the API or
>>>> documentation
>>>> >> 2. Seek some advice on how should we tackle the problem, obviously
>>>> we don't want rearranger to execute on only 1 thread (since we use it for
>>>> speed!), currently what we're considering is to relax the check for hit
>>>> count estimation, but maybe there's a better way?
>>>> >>
>>>> >> Best
>>>> >> Patrick
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
>>>> For additional commands, e-mail: dev-h...@lucene.apache.org
>>>>
>>>>

Reply via email to