Great, thanks for the explanation Adrien! I still learn something new about Lucene every day ;)
Mike McCandless http://blog.mikemccandless.com On Mon, May 24, 2021 at 1:38 PM Adrien Grand <jpou...@gmail.com> wrote: > Indeed collectors keep state about the minimum required score for a hit to > be competitive, typically the next float after the score of the bottom of > the priority queue. It can only go up as more documents/segments get > collected. Collectors have a feedback loop with scorers that enables > scorers to skip hits that cannot possibly make it to the top hits. > > If you can put segments that produce higher scores first, you will get > better performance. I don't think it can be easily done when sorting by > score, but when sorting by a field that correlates with index order, it > certainly can. This is the point of LUCENE-9507. > > With multiple slices/threads, every slice gets its own collector, but > LUCENE-8978 made collectors able to share information about their minimum > competitive scores across threads so that seeing good hits on a segment > could still help skip hits on other segments like when searching without a > threadpool. > > A possibly surprising consequence is that running the same query multiple > times could return different hit count estimations when IndexSearcher is > configured with a threadpool. > > > Le lun. 24 mai 2021 à 15:55, Michael McCandless <luc...@mikemccandless.com> > a écrit : > >> OK, it sounds like this is indeed expected (thanks Adrien, Greg, Mike, >> Patrick!), that the estimated hit count (GREATER_THAN_OR_EQUAL_TO) can >> indeed change if the segment order is different but otherwise identical >> documents are in each segment (yes, with different global doc ids, so the >> true identical document case would tie break by those). >> >> That was/is surprising to me -- I thought segment order did not alter >> anything at search time (except the tie-break-by-global-docid). I thought >> each segment was searched, independently, by itself, and then results >> "merged" together. But this has become more complex/interesting with time >> :) >> >> So is this because BMW will use the best results from previous segments >> to optimize/improve skipping while searching future segments? I.e. some >> sort of state carries over from having searched past segments, to better >> optimize searching the next segment? This makes total sense (that we would >> optimize this way). I guess this also means that concurrent search ("thread >> per slice/segment") would not be able to optimize with BMW as well? >> >> Mike McCandless >> >> http://blog.mikemccandless.com >> >> >> On Sun, May 23, 2021 at 8:24 AM Michael Sokolov <msoko...@gmail.com> >> wrote: >> >>> Agree with Adrien and Greg, approximate hit counts may vary. That global >>> docid tiebreaking may affect the actual results is an interesting >>> observation too, thanks Adrien. It seems to me the best way forward is to >>> relax the check. Depending on how the index is constructed the difference >>> could be quite large. Eg if the index is sorted and all the best documents >>> are in one segment, we could end up with quite different counts; they could >>> get many multiples different if you collect single threaded. >>> >>> On Fri, May 21, 2021, 2:22 PM Greg Miller <gsmil...@gmail.com> wrote: >>> >>>> Hi Patrick- >>>> >>>> Interesting finds and deep dive! >>>> >>>> Just to confirm, in the cases of differing hit counts that you've >>>> observed, these are TotalHits with a Relation of >>>> "GREATER_THAN_OR_EQUAL_TO" right? Never cases where the Relation is >>>> "EQUAL_TO"? I ask because that would change my opinion of >>>> whether-or-not this is concerning. If we're proving a hit count with >>>> "equal to" semantics, it should be correct and shouldn't change based >>>> on segment ordering, etc. But, on the other hand, if we're just >>>> providing a floor (i.e., "there are at least this many hits"), then >>>> I'm not particularly concerned by this (assuming that floor is >>>> actually correct in call cases and there aren't fewer hits). >>>> >>>> Cheers, >>>> -Greg >>>> >>>> On Fri, May 21, 2021 at 2:25 AM Adrien Grand <jpou...@gmail.com> wrote: >>>> > >>>> > Hi Patrick, >>>> > >>>> > >>>> > Why do you feel weird about the fact that segment order impacts the >>>> hit count estimation? It feels ok to me, especially as segment order has >>>> deeper implications, e.g. you could get different top hits given that >>>> Lucene uses the global doc ID as a tie breaker for documents that produce >>>> the same score. >>>> > >>>> > Could IndexRearranger write segments in deterministic order in the >>>> segments file to improve reproducibility? Or could your application >>>> configure leaf order explicitly? (LUCENE-9507) >>>> > >>>> > Le ven. 21 mai 2021 à 08:29, Patrick Zhai <zhai7...@gmail.com> a >>>> écrit : >>>> >> >>>> >> Hi folks >>>> >> >>>> >> For the past few weeks I've been working with Mike McCandless to use >>>> the recent introduced IndexRearranger to replace the old way of guarantee >>>> deterministic index -- using a single index thread and a LogDocMergePolicy. >>>> >> >>>> >> In the progress we found out that with two concurrently built but >>>> rearranged indexes, the estimation hit count will show a small difference. >>>> I've carefully checked the index and found they're almost the same but the >>>> segment order is different (index 1 might be segment 1,2,3,4,5 while index >>>> 2 might be segment 2,1,3,5,4 where nth segment contains exactly the same >>>> documents and sorted using the same criteria). So I suspected the segment >>>> order impacted the hit count estimation and to confirm that I turned off >>>> the concurrency of rearranger so that it will always create segments in >>>> order. The result proved my theory that the segment order was impacting the >>>> hit count estimation. >>>> >> >>>> >> Later on I did some investigation and found in TopScoreDocCollector >>>> we do have logic of updating the global minScore so I guess that's where >>>> makes the difference. Mike and I both feel a little weird that segment >>>> order will affect the hit count estimation, so just want to >>>> >> 1. See whether there's any chance we could improve the API or >>>> documentation >>>> >> 2. Seek some advice on how should we tackle the problem, obviously >>>> we don't want rearranger to execute on only 1 thread (since we use it for >>>> speed!), currently what we're considering is to relax the check for hit >>>> count estimation, but maybe there's a better way? >>>> >> >>>> >> Best >>>> >> Patrick >>>> >>>> --------------------------------------------------------------------- >>>> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org >>>> For additional commands, e-mail: dev-h...@lucene.apache.org >>>> >>>>