[jira] [Commented] (LUCENE-6276) Add matchCost() api to TwoPhaseDocIdSetIterator

Adrien Grand (JIRA) Mon, 19 Oct 2015 07:26:25 -0700

    [ 
https://issues.apache.org/jira/browse/LUCENE-6276?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14963379#comment-14963379
 ]


Adrien Grand commented on LUCENE-6276:
--------------------------------------

bq. It will be difficult for many of the 2-phase implementations to calculate a 
matchCost – particularly the ones not based on the number of term positions. 
What to do?

Agreed: we need to come with a very simple definition of matchCost that could 
be applied regardless of how matches() is implemented. I think we have two 
options:
 - either an estimate running time of matches() in nanoseconds,
 - or an average number of operations that need to be performed in matches(), 
so that you would add +1 every time you do a comparison, arithmetic operation, 
consume a PostingsEnum, etc.

Runtimes in nanoseconds could easily vary depending on hardware, JVM version, 
etc. so I think the 2nd option is more practical. For instance:
 - for a phrase query, we would return the sums of the average number of 
positions per documents (which is an estimate of how many times you will call 
PostingsEnum.nextPosition()). Maybe we could try to fold in the cost of 
balancing the priority queue too.
 - for a doc values range query on numbers, the match cost would be 3: one dv 
lookup and 2 comparisons
 - for a geo distance query that uses SloppyMath.haversin to confirm matches, 
we could easily count how many operations are performed by SloppyMath.haversin

This is simplistic but I think it would do the job and keep the implementation 
simple. For instance, a doc values range query would always be confirmed before 
a geo-distance query.

bq. But I see that the latest BooleanQuery.Builder is not stable due to use of 
HashSet / MultiSet versus LinkedHashSet which would be stable. What do you 
think Adrien Grand? 

Actually it is: those sets and multisets are only used for equals/hashcode. The 
creation of scorers is still based on the list of clauses, which maintains the 
order from the builder.

bq. Showing the matchCost in explain will be tricky because it is computed by 
LeafReaderContext, i.e. by segment.

+1 to not do it

bq. The matchCost is not yet used for the second phase in disjunctions. Yet 
another priority queue might be needed for that, so I'd prefer to delay that to 
another issue.

Feel free to delay, I plan to explore this in LUCENE-6815.

> Add matchCost() api to TwoPhaseDocIdSetIterator
> -----------------------------------------------
>
>                 Key: LUCENE-6276
>                 URL: https://issues.apache.org/jira/browse/LUCENE-6276
>             Project: Lucene - Core
>          Issue Type: Improvement
>            Reporter: Robert Muir
>         Attachments: LUCENE-6276-ExactPhraseOnly.patch, 
> LUCENE-6276-NoSpans.patch, LUCENE-6276-NoSpans2.patch, LUCENE-6276.patch, 
> LUCENE-6276.patch, LUCENE-6276.patch, LUCENE-6276.patch
>
>
> We could add a method like TwoPhaseDISI.matchCost() defined as something like 
> estimate of nanoseconds or similar. 
> ConjunctionScorer could use this method to sort its 'twoPhaseIterators' array 
> so that cheaper ones are called first. Today it has no idea if one scorer is 
> a simple phrase scorer on a short field vs another that might do some geo 
> calculation or more expensive stuff.
> PhraseScorers could implement this based on index statistics (e.g. 
> totalTermFreq/maxDoc)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (LUCENE-6276) Add matchCost() api to TwoPhaseDocIdSetIterator

Reply via email to