[
https://issues.apache.org/jira/browse/LUCENE-7055?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15797937#comment-15797937
]
Michael McCandless commented on LUCENE-7055:
--------------------------------------------
I really like this idea and the latest patch: I think it will be an immense
query-time optimization for some cases, e.g. a restrictive {{TermQuery}}
against a massive {{PointRangeQuery}} where doc values are also indexed for
that range field.
I like how this solution let's us "phase in" queries over time (default impl
for lazyScorer).
For the {{BKDReader}} impls and {{PointValues}} APIs can we rename
{{estimateCost}} to {{estimatePointCount}} or just {{estimateCount}} since
"cost" is a bit more vague here yet what we are computing is somewhat tightly
defined. I think {{cost}} is a good name for the {{LazyScorer}} method.
Maybe rename {{LazyScorer}} to {{ScorerSource}}? {{LazyScorer}} makes me feel
like the laziness applies during actual iteration of the hits...
I like the switch to a {{Map<Occur,Collection>}} for boolean scorer's {{subs}}
tracking.
{{FakeLazyScorer}} in {{TestLazyBoolean2Scorer}} seems to fail to initialize
its {{this.randomAccess}} in its 2nd ctor so the assert is never invoked?
If I pass {{randomAccess = false}} to {{LazyScorer.get}} am I not allowed to
invoke {{advance}} on the returned {{Scorer}}? Maybe the javadocs can call
this argument "hint about expected usage"? It's too bad this is not somehow
more strongly typed, like you get back a {{Bits}} (plus some way to score if
it's needed) if you asked for random access, but I don't see how to do that.
Long ago (can't find the issue now) we had an issue exploring something along
these lines. But, let's keep the approach now in your patch: progress not
perfection!
We don't need to implement it now, but I'm curious how we'll implement the cost
method for multi term queries? It seems like merely computing the cost
(enumerating all terms & summing their {{sumDocFreq}}) would be a big part of
the overall cost of executing such queries. I guess we would also need a
doc-values based query here too, e.g. one that checks the automaton on a binary
doc values field or something?
Maybe change {{if (cost < 0) { }} to {{if (cost == -1) {}} in
{{LazyBoolean2Scorer}} (more explicit)?
> Better execution path for costly queries
> ----------------------------------------
>
> Key: LUCENE-7055
> URL: https://issues.apache.org/jira/browse/LUCENE-7055
> Project: Lucene - Core
> Issue Type: Improvement
> Reporter: Adrien Grand
> Assignee: Adrien Grand
> Attachments: LUCENE-7055.patch, LUCENE-7055.patch
>
>
> In Lucene 5.0, we improved the execution path for queries that run costly
> operations on a per-document basis, like phrase queries or doc values
> queries. But we have another class of costly queries, that return fine
> iterators, but these iterators are very expensive to build. This is typically
> the case for queries that leverage DocIdSetBuilder, like TermsQuery,
> multi-term queries or the new point queries. Intersecting such queries with a
> selective query is very inefficient since these queries build a doc id set of
> matching documents for the entire index.
> Is there something we could do to improve the execution path for these
> queries?
> One idea that comes to mind is that most of these queries could also run on
> doc values, so maybe we could come up with something that would help decide
> how to run a query based on other parts of the query? (Just thinking out
> loud, other ideas are very welcome)
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]