[
https://issues.apache.org/jira/browse/LUCENE-6276?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14963379#comment-14963379
]
Adrien Grand commented on LUCENE-6276:
--------------------------------------
bq. It will be difficult for many of the 2-phase implementations to calculate a
matchCost – particularly the ones not based on the number of term positions.
What to do?
Agreed: we need to come with a very simple definition of matchCost that could
be applied regardless of how matches() is implemented. I think we have two
options:
- either an estimate running time of matches() in nanoseconds,
- or an average number of operations that need to be performed in matches(),
so that you would add +1 every time you do a comparison, arithmetic operation,
consume a PostingsEnum, etc.
Runtimes in nanoseconds could easily vary depending on hardware, JVM version,
etc. so I think the 2nd option is more practical. For instance:
- for a phrase query, we would return the sums of the average number of
positions per documents (which is an estimate of how many times you will call
PostingsEnum.nextPosition()). Maybe we could try to fold in the cost of
balancing the priority queue too.
- for a doc values range query on numbers, the match cost would be 3: one dv
lookup and 2 comparisons
- for a geo distance query that uses SloppyMath.haversin to confirm matches,
we could easily count how many operations are performed by SloppyMath.haversin
This is simplistic but I think it would do the job and keep the implementation
simple. For instance, a doc values range query would always be confirmed before
a geo-distance query.
bq. But I see that the latest BooleanQuery.Builder is not stable due to use of
HashSet / MultiSet versus LinkedHashSet which would be stable. What do you
think Adrien Grand?
Actually it is: those sets and multisets are only used for equals/hashcode. The
creation of scorers is still based on the list of clauses, which maintains the
order from the builder.
bq. Showing the matchCost in explain will be tricky because it is computed by
LeafReaderContext, i.e. by segment.
+1 to not do it
bq. The matchCost is not yet used for the second phase in disjunctions. Yet
another priority queue might be needed for that, so I'd prefer to delay that to
another issue.
Feel free to delay, I plan to explore this in LUCENE-6815.
> Add matchCost() api to TwoPhaseDocIdSetIterator
> -----------------------------------------------
>
> Key: LUCENE-6276
> URL: https://issues.apache.org/jira/browse/LUCENE-6276
> Project: Lucene - Core
> Issue Type: Improvement
> Reporter: Robert Muir
> Attachments: LUCENE-6276-ExactPhraseOnly.patch,
> LUCENE-6276-NoSpans.patch, LUCENE-6276-NoSpans2.patch, LUCENE-6276.patch,
> LUCENE-6276.patch, LUCENE-6276.patch, LUCENE-6276.patch
>
>
> We could add a method like TwoPhaseDISI.matchCost() defined as something like
> estimate of nanoseconds or similar.
> ConjunctionScorer could use this method to sort its 'twoPhaseIterators' array
> so that cheaper ones are called first. Today it has no idea if one scorer is
> a simple phrase scorer on a short field vs another that might do some geo
> calculation or more expensive stuff.
> PhraseScorers could implement this based on index statistics (e.g.
> totalTermFreq/maxDoc)
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]