[ https://issues.apache.org/jira/browse/LUCENE-8681?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16762656#comment-16762656 ]
Mike Sokolov commented on LUCENE-8681: -------------------------------------- bq. However I wonder if this could be implemented directly in a custom CollectorManager, currently the CollectorManager creates a Collector for each leaf if the executor is not null and merge all the Yes, that's totally do-able, but I think doing this here has value since it is a good default for anyone performing multithreaded collection with approximation enabled, and doesn't really impact single-threaded collection in a significant way. > Prorated early termination > -------------------------- > > Key: LUCENE-8681 > URL: https://issues.apache.org/jira/browse/LUCENE-8681 > Project: Lucene - Core > Issue Type: Improvement > Components: core/search > Reporter: Mike Sokolov > Priority: Major > > In this issue we'll exploit the distribution of top K documents among > segments to extract performance gains when using early termination. The basic > idea is we do not need to collect K documents from every segment and then > merge. Rather we can collect a number of documents that is proportional to > the segment's size plus an error bound derived from the combinatorics seen as > a (multinomial) probability distribution. > https://github.com/apache/lucene-solr/pull/564 has the proposed change. > [~rcmuir] pointed out on the mailing list that this patch confounds two > settings: (1) whether to collect all hits, ensuring correct hit counts, and > (2) whether to guarantee that the top K hits are precisely the top K. > The current patch treats this as the same thing. It takes the position that > if the user says it's OK to have approximate counts, then it's also OK to > introduce some small chance of ranking error; occasionally some of the top K > we return may draw from the top K + epsilon. > Instead we could provide some additional knobs to the user. Currently the > public API is {{TopFieldCOllector.create(Sort, int, FieldDoc, int > threshold)}}. The threshold parameter controls when to apply early > termination; it allows the collector to terminate once the given number of > documents have been collected. > Instead of using the same threshold to control leaf-level early termination, > we could provide an additional leaf-level parameter. For example, this could > be a scale factor on the error bound, eg a number of standard deviations to > apply. The patch uses 3, but a much more conservative bound would be 4 or > even 5. With these values, some speedup would still result, but with a much > lower level of ranking errors. A value of MAX_INT would ensure no leaf-level > termination would ever occur. > We could also hide the precise numerical bound and offer users a three-way > enum (EXACT, APPROXIMATE_COUNT, APPROXIMATE_RANK) that controls whether to > apply this optimization, using some predetermined error bound. > I posted the patch without any user-level tuning since I think the user has > already indicated a preference for speed over precision by specifying a > finite (global) threshold, but if we want to provide finer control, these two > options seem to make the most sense to me. Providing access to the number of > standard deviation to allow from the expected distribution gives the user the > finest control, but it could be hard to explain its proper use. -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org