mikemccand commented on issue #823: LUCENE-8939: Introduce Shared Count Early Termination In Parallel Search URL: https://github.com/apache/lucene-solr/pull/823#issuecomment-527503405 > > > https://issues.apache.org/jira/browse/LUCENE-8681 > > > > > > I am not sure I see how that solves the problem? > > The core change targeted in this PR is allowing `TOTAL_HITS_THRESHOLD` to be accurately counted across all slices. Today, we collect `TOTAL_HITS_THRESHOLD` per slice, which is not what the API definition is. Post this PR, we will collect `TOTAL_HITS_THRESHOLD` in aggregate. `TOTAL_HITS_THRESHOLD`' s definition does not guarantee any order of collection of hits in the concurrent case -- we inadvertently define one today by collection the threshold number of hits per slice. > > RE: Proration, I believe that is a custom logic that can be added on top of this change. In any case, the proration logic also works on a bunch of static values + fudge factors, so it can go wrong and we might end up collecting lesser hits from a more valuable segment. To help prevent this scenario, I believe proration might also do well to build upon this PR and use the shared counter. But, I am unable to see why proration and accurate counting across slices are mutually exclusive. > > In any case, unlike proration, this PR does not propose any algorithmic changes to the way collection is done -- it simply reduces extra work done across slices that we do not even advertise today, so might be something that the user is unaware of. > > To summarize my monologue, this PR is aimed at accurate counting of hits across all slices -- whereas proration targets a different use case of trying to "distributing" hits across slices based on some parameters. Thanks @atris, I was worried this change would alter the correctness of the top hits in the concurrent case based on thread scheduling, but I was wrong: In the sorted index case, we only early terminate a searcher thread (slice) once its thread-private PQ is full and the global (1000 default) hit count has been collected, so that way we know the competitive top hits from that segment will be merged/reduced in the end. In the unsorted index case, where we skip by impacts once we collect more than the 1000 by default, we are also still correct because we continue collecting in that slice, just skipping by impact based on that thread's private PQ bottom. We can make further improvements e.g. to share the global PQ bottom across all searcher threads, but that should come later. So net/net I think the change is correct, and should be a big performance gain for concurrent searching. Sorry for the confusion ;)
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org