Hi Patrick- Interesting finds and deep dive!
Just to confirm, in the cases of differing hit counts that you've observed, these are TotalHits with a Relation of "GREATER_THAN_OR_EQUAL_TO" right? Never cases where the Relation is "EQUAL_TO"? I ask because that would change my opinion of whether-or-not this is concerning. If we're proving a hit count with "equal to" semantics, it should be correct and shouldn't change based on segment ordering, etc. But, on the other hand, if we're just providing a floor (i.e., "there are at least this many hits"), then I'm not particularly concerned by this (assuming that floor is actually correct in call cases and there aren't fewer hits). Cheers, -Greg On Fri, May 21, 2021 at 2:25 AM Adrien Grand <jpou...@gmail.com> wrote: > > Hi Patrick, > > > Why do you feel weird about the fact that segment order impacts the hit count > estimation? It feels ok to me, especially as segment order has deeper > implications, e.g. you could get different top hits given that Lucene uses > the global doc ID as a tie breaker for documents that produce the same score. > > Could IndexRearranger write segments in deterministic order in the segments > file to improve reproducibility? Or could your application configure leaf > order explicitly? (LUCENE-9507) > > Le ven. 21 mai 2021 à 08:29, Patrick Zhai <zhai7...@gmail.com> a écrit : >> >> Hi folks >> >> For the past few weeks I've been working with Mike McCandless to use the >> recent introduced IndexRearranger to replace the old way of guarantee >> deterministic index -- using a single index thread and a LogDocMergePolicy. >> >> In the progress we found out that with two concurrently built but rearranged >> indexes, the estimation hit count will show a small difference. I've >> carefully checked the index and found they're almost the same but the >> segment order is different (index 1 might be segment 1,2,3,4,5 while index 2 >> might be segment 2,1,3,5,4 where nth segment contains exactly the same >> documents and sorted using the same criteria). So I suspected the segment >> order impacted the hit count estimation and to confirm that I turned off the >> concurrency of rearranger so that it will always create segments in order. >> The result proved my theory that the segment order was impacting the hit >> count estimation. >> >> Later on I did some investigation and found in TopScoreDocCollector we do >> have logic of updating the global minScore so I guess that's where makes the >> difference. Mike and I both feel a little weird that segment order will >> affect the hit count estimation, so just want to >> 1. See whether there's any chance we could improve the API or documentation >> 2. Seek some advice on how should we tackle the problem, obviously we don't >> want rearranger to execute on only 1 thread (since we use it for speed!), >> currently what we're considering is to relax the check for hit count >> estimation, but maybe there's a better way? >> >> Best >> Patrick --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org