[jira] [Comment Edited] (LUCENE-8681) Prorated early termination

Mike Sokolov (JIRA) Thu, 07 Feb 2019 05:29:16 -0800


    [ 
https://issues.apache.org/jira/browse/LUCENE-8681?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16762656#comment-16762656
 ]


Mike Sokolov edited comment on LUCENE-8681 at 2/7/19 1:28 PM:
--------------------------------------------------------------

bq. However I wonder if this could be implemented directly in a custom 
CollectorManager, currently the CollectorManager creates a Collector for each 
leaf if the executor is not null and merge all the 

Yes, that's totally do-able, but I think doing this here has value since it is 
a good default for anyone performing multithreaded collection with 
approximation enabled, and doesn't really impact single-threaded collection in 
a significant way.

Just to be a little more concrete about the scales here. With a "safety margin" 
of 3 std deviations from the mean, you can expect to see ranking errors (eg 
some result whose true rank is > N in the returned top N) in 1/740 queries (see 
https://en.wikipedia.org/wiki/68%E2%80%9395%E2%80%9399.7_rule and note that 
this is a 1-sided distribution -- it's OK if a segment has *fewer* top N 
results than we expect).  With 5 standard deviations, the frequency frpos to 
around 1 in 3 million. 


was (Author: sokolov):
bq. However I wonder if this could be implemented directly in a custom 
CollectorManager, currently the CollectorManager creates a Collector for each 
leaf if the executor is not null and merge all the 

Yes, that's totally do-able, but I think doing this here has value since it is 
a good default for anyone performing multithreaded collection with 
approximation enabled, and doesn't really impact single-threaded collection in 
a significant way.

> Prorated early termination
> --------------------------
>
>                 Key: LUCENE-8681
>                 URL: https://issues.apache.org/jira/browse/LUCENE-8681
>             Project: Lucene - Core
>          Issue Type: Improvement
>          Components: core/search
>            Reporter: Mike Sokolov
>            Priority: Major
>
> In this issue we'll exploit the distribution of top K documents among 
> segments to extract performance gains when using early termination. The basic 
> idea is we do not need to collect K documents from every segment and then 
> merge. Rather we can collect a number of documents that is proportional to 
> the segment's size plus an error bound derived from the combinatorics seen as 
> a (multinomial) probability distribution.
> https://github.com/apache/lucene-solr/pull/564 has the proposed change.
> [~rcmuir] pointed out on the mailing list that this patch confounds two 
> settings: (1) whether to collect all hits, ensuring correct hit counts, and 
> (2) whether to guarantee that the top K hits are precisely the top K.
> The current patch treats this as the same thing. It takes the position that 
> if the user says it's OK to have approximate counts, then it's also OK to 
> introduce some small chance of ranking error; occasionally some of the top K 
> we return may draw from the top K + epsilon.
> Instead we could provide some additional knobs to the user. Currently the 
> public API is {{TopFieldCOllector.create(Sort, int, FieldDoc, int 
> threshold)}}. The threshold parameter controls when to apply early 
> termination; it allows the collector to terminate once the given number of 
> documents have been collected.
> Instead of using the same threshold to control leaf-level early termination, 
> we could provide an additional leaf-level parameter. For example, this could 
> be a scale factor on the error bound, eg a number of standard deviations to 
> apply. The patch uses 3, but a much more conservative bound would be 4 or 
> even 5. With these values, some speedup would still result, but with a much 
> lower level of ranking errors. A value of MAX_INT would ensure no leaf-level 
> termination would ever occur.
> We could also hide the precise numerical bound and offer users a three-way 
> enum (EXACT, APPROXIMATE_COUNT, APPROXIMATE_RANK) that controls whether to 
> apply this optimization, using some predetermined error bound.
> I posted the patch without any user-level tuning since I think the user has 
> already indicated a preference for speed over precision by specifying a 
> finite (global) threshold, but if we want to provide finer control, these two 
> options seem to make the most sense to me. Providing access to the number of 
> standard deviation to allow from the expected distribution gives the user the 
> finest control, but it could be hard to explain its proper use.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Comment Edited] (LUCENE-8681) Prorated early termination

Reply via email to