[jira] [Commented] (LUCENE-4872) BooleanWeight should decide how to execute minNrShouldMatch

2013-03-27 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4872?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13615951#comment-13615951
 ] 

Robert Muir commented on LUCENE-4872:
-

{quote}
I really don't really know what the typical/common use cases are for
minShouldMatch.
{quote}

One very practical thing is that solr queryparsers (probably elasticsearch has 
similar ones too?) such as dismax/edismax actually seem to be fully defined in 
terms of minShouldMatch (with the extremes being handled as OR and AND). 

I know Tom Burton-West has experimented with this some on chinese TREC data (he 
has some comments on SOLR-3589), etc.


> BooleanWeight should decide how to execute minNrShouldMatch
> ---
>
> Key: LUCENE-4872
> URL: https://issues.apache.org/jira/browse/LUCENE-4872
> Project: Lucene - Core
>  Issue Type: Sub-task
>  Components: core/search
>Reporter: Robert Muir
> Fix For: 5.0, 4.3
>
> Attachments: crazyMinShouldMatch.tasks
>
>
> LUCENE-4571 adds a dedicated document-at-time scorer for minNrShouldMatch 
> which can use advance() behind the scenes. 
> In cases where you have some really common terms and some rare ones this can 
> be a huge performance improvement.
> On the other hand BooleanScorer might still be faster in some cases.
> We should think about what the logic should be here: one simple thing to do 
> is to always use the new scorer when minShouldMatch is set: thats where i'm 
> leaning. 
> But maybe we could have a smarter heuristic too, perhaps based on cost()

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4872) BooleanWeight should decide how to execute minNrShouldMatch

2013-03-27 Thread Eks Dev (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4872?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13615006#comment-13615006
 ] 

Eks Dev commented on LUCENE-4872:
-

the same pattern like Simon here, just having these terms wrapped in 
fuzzy/prefix query, often as dismax query. 

for example:
BQ(boo* OR hoo* OR whatever) with e.g. minShouldMatch = 2  

So the only diff to Simon's case is that single boolean clauses are often more 
complicated then simple TermQuery 


> BooleanWeight should decide how to execute minNrShouldMatch
> ---
>
> Key: LUCENE-4872
> URL: https://issues.apache.org/jira/browse/LUCENE-4872
> Project: Lucene - Core
>  Issue Type: Sub-task
>  Components: core/search
>Reporter: Robert Muir
> Fix For: 5.0, 4.3
>
> Attachments: crazyMinShouldMatch.tasks
>
>
> LUCENE-4571 adds a dedicated document-at-time scorer for minNrShouldMatch 
> which can use advance() behind the scenes. 
> In cases where you have some really common terms and some rare ones this can 
> be a huge performance improvement.
> On the other hand BooleanScorer might still be faster in some cases.
> We should think about what the logic should be here: one simple thing to do 
> is to always use the new scorer when minShouldMatch is set: thats where i'm 
> leaning. 
> But maybe we could have a smarter heuristic too, perhaps based on cost()

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4872) BooleanWeight should decide how to execute minNrShouldMatch

2013-03-26 Thread Simon Willnauer (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4872?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13614514#comment-13614514
 ] 

Simon Willnauer commented on LUCENE-4872:
-

I often use min_should_match in practice. Like one example is if you do search 
for titles or name like POI's or meta-data. Lets take youtube as an example you 
often get queries like "queen wembley live 1989" which was in-fact 1986 (at 
least the one I meant here) a pretty good pattern is to use some metric like 
80% must match if >= 2 query terms etc. 
Another good example is if you use shingles a query like "queen wembley live 
1989" produces lots of terms and "wembley live" might be pretty common so you 
want to make sure that you are not returning stuff from other band but on the 
other hand a pure conjunction is not acceptable here either. 

hope that give some insight?

> BooleanWeight should decide how to execute minNrShouldMatch
> ---
>
> Key: LUCENE-4872
> URL: https://issues.apache.org/jira/browse/LUCENE-4872
> Project: Lucene - Core
>  Issue Type: Sub-task
>  Components: core/search
>Reporter: Robert Muir
> Fix For: 5.0, 4.3
>
> Attachments: crazyMinShouldMatch.tasks
>
>
> LUCENE-4571 adds a dedicated document-at-time scorer for minNrShouldMatch 
> which can use advance() behind the scenes. 
> In cases where you have some really common terms and some rare ones this can 
> be a huge performance improvement.
> On the other hand BooleanScorer might still be faster in some cases.
> We should think about what the logic should be here: one simple thing to do 
> is to always use the new scorer when minShouldMatch is set: thats where i'm 
> leaning. 
> But maybe we could have a smarter heuristic too, perhaps based on cost()

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4872) BooleanWeight should decide how to execute minNrShouldMatch

2013-03-26 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4872?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13614475#comment-13614475
 ] 

Michael McCandless commented on LUCENE-4872:


bq. What about your own great work 
(http://blog.mikemccandless.com/2013/02/drill-sideways-faceting-with-lucene.html)
 as a use-case to start with?

Thank Stefan :)  That's sort of a specialized (but very useful) use case I 
think ... and the minShouldMatch is always N-1.

bq. Maybe some consulting committers can also share some insight on how this is 
used in the wild.

+1, that'd be great to know!

> BooleanWeight should decide how to execute minNrShouldMatch
> ---
>
> Key: LUCENE-4872
> URL: https://issues.apache.org/jira/browse/LUCENE-4872
> Project: Lucene - Core
>  Issue Type: Sub-task
>  Components: core/search
>Reporter: Robert Muir
> Fix For: 5.0, 4.3
>
> Attachments: crazyMinShouldMatch.tasks
>
>
> LUCENE-4571 adds a dedicated document-at-time scorer for minNrShouldMatch 
> which can use advance() behind the scenes. 
> In cases where you have some really common terms and some rare ones this can 
> be a huge performance improvement.
> On the other hand BooleanScorer might still be faster in some cases.
> We should think about what the logic should be here: one simple thing to do 
> is to always use the new scorer when minShouldMatch is set: thats where i'm 
> leaning. 
> But maybe we could have a smarter heuristic too, perhaps based on cost()

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4872) BooleanWeight should decide how to execute minNrShouldMatch

2013-03-26 Thread Stefan Pohl (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4872?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13614371#comment-13614371
 ] 

Stefan Pohl commented on LUCENE-4872:
-

{quote}
I really don't really know what the typical/common use cases are for 
minShouldMatch.
{quote}
What about your own great work 
(http://blog.mikemccandless.com/2013/02/drill-sideways-faceting-with-lucene.html)
 as a use-case to start with?
Maybe some consulting committers can also share some insight on how this is 
used in the wild.

> BooleanWeight should decide how to execute minNrShouldMatch
> ---
>
> Key: LUCENE-4872
> URL: https://issues.apache.org/jira/browse/LUCENE-4872
> Project: Lucene - Core
>  Issue Type: Sub-task
>  Components: core/search
>Reporter: Robert Muir
> Fix For: 5.0, 4.3
>
> Attachments: crazyMinShouldMatch.tasks
>
>
> LUCENE-4571 adds a dedicated document-at-time scorer for minNrShouldMatch 
> which can use advance() behind the scenes. 
> In cases where you have some really common terms and some rare ones this can 
> be a huge performance improvement.
> On the other hand BooleanScorer might still be faster in some cases.
> We should think about what the logic should be here: one simple thing to do 
> is to always use the new scorer when minShouldMatch is set: thats where i'm 
> leaning. 
> But maybe we could have a smarter heuristic too, perhaps based on cost()

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4872) BooleanWeight should decide how to execute minNrShouldMatch

2013-03-25 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4872?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13613035#comment-13613035
 ] 

Michael McCandless commented on LUCENE-4872:


I really don't really know what the typical/common use cases are for
minShouldMatch.

I agree we should err towards BS2, since it can be insanely faster
while BS1 can only be ~3X faster (on super-slow queries to begin
with), in this test anyway.

A more accurate cost model for scorers would be awesome!  This could
be a general framework that we'd be able to use for various forms for
query optimizing (which we don't do today or do with heuristics), eg
things like whether to apply a filter (AND) high vs low, whether to
use BS1 or BS2 for pure conjunctions, when to split a PhraseQuery into
conjunction + position checking, flattening of nested boolean
queries, MultiTermQuery rewrite method, etc.  But probably we should
explore this on a new issue.


> BooleanWeight should decide how to execute minNrShouldMatch
> ---
>
> Key: LUCENE-4872
> URL: https://issues.apache.org/jira/browse/LUCENE-4872
> Project: Lucene - Core
>  Issue Type: Sub-task
>  Components: core/search
>Reporter: Robert Muir
> Fix For: 5.0, 4.3
>
> Attachments: crazyMinShouldMatch.tasks
>
>
> LUCENE-4571 adds a dedicated document-at-time scorer for minNrShouldMatch 
> which can use advance() behind the scenes. 
> In cases where you have some really common terms and some rare ones this can 
> be a huge performance improvement.
> On the other hand BooleanScorer might still be faster in some cases.
> We should think about what the logic should be here: one simple thing to do 
> is to always use the new scorer when minShouldMatch is set: thats where i'm 
> leaning. 
> But maybe we could have a smarter heuristic too, perhaps based on cost()

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4872) BooleanWeight should decide how to execute minNrShouldMatch

2013-03-25 Thread Stefan Pohl (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4872?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13612467#comment-13612467
 ] 

Stefan Pohl commented on LUCENE-4872:
-

Thanks, Mike, this behaves as expected. Now we have a sense of what trade-off 
we'd be going for if we agree on the current model, it is still a hard decision 
though, entailing questions like:
- Does it matter that queries that are anyway slow got 2-3 times slower?
- Are those queries representative to what users do?

A few suggestions for a better model that maybe go beyond the scope of this 
ticket:

A very conservative usage rule for MSMSumScorer would be to use it only if the 
constraint is at least one higher than the number of high-freq terms, then it 
will always "kick butt" and we'd get most bang of this scorer without having 
slow-downs. But we'd miss out on many cases where it would be faster and those 
might be the ones that are used in practice by users, and it is not clear (to 
me:-) what 'high-freq' means. If at all, this should be seen relative to the 
highest-freq subclause.

More generally, it seems to me the problem we're trying to solve here is 
identical to computing a cost. If the cost returned by Scorers correlates with 
execution time, then we could simply call the cost() method on BS and 
MSMSumScorer and use MSMSumScorer if it is significantly below the former 
(assuming there are no side-effects in doing these calls). So we'd defer the 
problem to the individual Scorers, which splits the problem up into smaller 
subproblems and the Scorers know themselves best about their implementation and 
behavior.

To make accurate decisions, we probably have to extend the cost-API to return 
more detailed information to base decision rules on, e.g. upper bound, lower 
bound (to be able to make conservative/speculative decisions) and estimate the 
number of returned docs *and* runtime-correlated cost (in some unit). For 
instance, MSMSumScorer's overall cost depends on both of the latter and can be 
split up into the following 2 stages:

1) Candidate generation = heap-based merge of clause subset, i.e. the same as 
for DisjSumScorer, but on a clause subset:
time to generate all docs from subScorer: correlates with sum over costs of 
#clauses-(mm-1) least-costly subScorers
# candidates = [max(...), min(sum(...), maxdoc)], where ... can be either an 
upper bound, lower bound or an estimate in between of the #candidates returned 
by the #clauses-(mm-1) subScorers
Even for TermScorer, the definition of these two measures are not identical due 
to the min(..., maxdoc).

2) Full scoring of candidates:
time to advance() and decode postings: (mm-1) * # candidates

The costs would still have to be weighted by the relative overhead of 1) 
heap-merging, 2) advance() + early-stopping; not sure, if constants are enough 
here.

While the scope of this topic seems large (modelling all scorers), I currently 
don't see a simpler way to make this reliably work for arbitrarily structured 
queries, think of MSM(subtree1, Disj(MSM(Conj(....

> BooleanWeight should decide how to execute minNrShouldMatch
> ---
>
> Key: LUCENE-4872
> URL: https://issues.apache.org/jira/browse/LUCENE-4872
> Project: Lucene - Core
>  Issue Type: Sub-task
>  Components: core/search
>Reporter: Robert Muir
> Fix For: 5.0, 4.3
>
> Attachments: crazyMinShouldMatch.tasks
>
>
> LUCENE-4571 adds a dedicated document-at-time scorer for minNrShouldMatch 
> which can use advance() behind the scenes. 
> In cases where you have some really common terms and some rare ones this can 
> be a huge performance improvement.
> On the other hand BooleanScorer might still be faster in some cases.
> We should think about what the logic should be here: one simple thing to do 
> is to always use the new scorer when minShouldMatch is set: thats where i'm 
> leaning. 
> But maybe we could have a smarter heuristic too, perhaps based on cost()

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4872) BooleanWeight should decide how to execute minNrShouldMatch

2013-03-22 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4872?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13610312#comment-13610312
 ] 

Robert Muir commented on LUCENE-4872:
-

To really do this right I think we need a better tasks file for luceneutil 
probably too.

> BooleanWeight should decide how to execute minNrShouldMatch
> ---
>
> Key: LUCENE-4872
> URL: https://issues.apache.org/jira/browse/LUCENE-4872
> Project: Lucene - Core
>  Issue Type: Sub-task
>  Components: core/search
>Reporter: Robert Muir
> Fix For: 5.0, 4.3
>
>
> LUCENE-4571 adds a dedicated document-at-time scorer for minNrShouldMatch 
> which can use advance() behind the scenes. 
> In cases where you have some really common terms and some rare ones this can 
> be a huge performance improvement.
> On the other hand BooleanScorer might still be faster in some cases.
> We should think about what the logic should be here: one simple thing to do 
> is to always use the new scorer when minShouldMatch is set: thats where i'm 
> leaning. 
> But maybe we could have a smarter heuristic too, perhaps based on cost()

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org