[
https://issues.apache.org/jira/browse/LUCENE-8681?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16769698#comment-16769698
]
Mike Sokolov commented on LUCENE-8681:
--------------------------------------
There are a bunch of different ways to provide for opt-in here. The most
focused would be to just require users to call {{IndexSearcher.search(Query,
CollectorManager)}}. That's currently the only way to invoke concurrent
collection, We could provide a convenient {{CollectorManager}} via a static
method in {{TopFieldCollector}}. I think probably that's enough for this issue?
I thought about how to make this easier for users by pushing up to higher-level
APIs. There's not an obvious right way, but here's my 2c. Following the
current API one would add yet more overrides of {{IndexSearcher.search}} and
{{IndexSearcher.searchAfter}} providing the ability to supply a threshold or
boolean to enable this feature. I see that there is no such convenience
available for {{trackTotalHits}}, and I suspect folks felt there were simply
too many overrides already? It certainly seems that way to me. When I see an
API getting a great many parameters and overloads with different sets of them,
I want to introduce a class to hold them (we don't have optional args and
default args in Java). IndexSearcher could take a SearchConfig object that
would just be a simple struct holding its various options (sort, numHits,
doDocScores, doMaxScore, trackTotalHits, proratedEarlyTerminationThreshold,
etc. That would make the search()/searchAfter() methods have simpler signatures
(eventually). Having an object class to hold options (like IndexWriterConfig)
gives a nice centralized way to provide documentation. Also, in a search UI one
often varies pagination and sorting via a different code path than the core
Query, so it feels natural to me to use a different abstraction to track those
things.
> Prorated early termination
> --------------------------
>
> Key: LUCENE-8681
> URL: https://issues.apache.org/jira/browse/LUCENE-8681
> Project: Lucene - Core
> Issue Type: Improvement
> Components: core/search
> Reporter: Mike Sokolov
> Priority: Major
>
> In this issue we'll exploit the distribution of top K documents among
> segments to extract performance gains when using early termination. The basic
> idea is we do not need to collect K documents from every segment and then
> merge. Rather we can collect a number of documents that is proportional to
> the segment's size plus an error bound derived from the combinatorics seen as
> a (multinomial) probability distribution.
> https://github.com/apache/lucene-solr/pull/564 has the proposed change.
> [~rcmuir] pointed out on the mailing list that this patch confounds two
> settings: (1) whether to collect all hits, ensuring correct hit counts, and
> (2) whether to guarantee that the top K hits are precisely the top K.
> The current patch treats this as the same thing. It takes the position that
> if the user says it's OK to have approximate counts, then it's also OK to
> introduce some small chance of ranking error; occasionally some of the top K
> we return may draw from the top K + epsilon.
> Instead we could provide some additional knobs to the user. Currently the
> public API is {{TopFieldCOllector.create(Sort, int, FieldDoc, int
> threshold)}}. The threshold parameter controls when to apply early
> termination; it allows the collector to terminate once the given number of
> documents have been collected.
> Instead of using the same threshold to control leaf-level early termination,
> we could provide an additional leaf-level parameter. For example, this could
> be a scale factor on the error bound, eg a number of standard deviations to
> apply. The patch uses 3, but a much more conservative bound would be 4 or
> even 5. With these values, some speedup would still result, but with a much
> lower level of ranking errors. A value of MAX_INT would ensure no leaf-level
> termination would ever occur.
> We could also hide the precise numerical bound and offer users a three-way
> enum (EXACT, APPROXIMATE_COUNT, APPROXIMATE_RANK) that controls whether to
> apply this optimization, using some predetermined error bound.
> I posted the patch without any user-level tuning since I think the user has
> already indicated a preference for speed over precision by specifying a
> finite (global) threshold, but if we want to provide finer control, these two
> options seem to make the most sense to me. Providing access to the number of
> standard deviation to allow from the expected distribution gives the user the
> finest control, but it could be hard to explain its proper use.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]