[jira] [Comment Edited] (LUCENE-4396) BooleanScorer should sometimes be used for MUST clauses

Da Huang (JIRA) Sat, 21 Jun 2014 03:03:07 -0700

    [ 
https://issues.apache.org/jira/browse/LUCENE-4396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14039754#comment-14039754
 ]


Da Huang edited comment on LUCENE-4396 at 6/21/14 10:00 AM:
------------------------------------------------------------

{quote}
 Looks like you separated required & optional
scores in the non-DAAT impls and then carefully cast to float at the
right times?
{quote}
Yes, you get what I mean.
{quote}
you can remove that TODO in ConjunctionScorer on
switching sum to double?
{quote}
OK, I will do that on next patch.
{quote}
So BooleanScorerIO is just like BooleanNovelScorer, except it uses a
bitset instead of linked list to track the set buckets? Between BNS
and BSIO which one is faster?
{quote}
Yes. exactly. According to perf. tests, it seems that
BNS do better for those tasks faster than the trunk, 
while do better for those tasks slower than the trunk.
{quote}
Why does BSIO/NS see massive gains on the tasks that have so many NOT
clauses? I think in trunk/4.x today, we are not scoring the NOT
clauses, right? While these gains are sizable, I think it's not a
common use case...
{quote}
The reason is that when we search for "+a -b -c -d", 
lucene actually do "+a -(b c d)" and the cost of getting disjunction of (b c d) 
is huge.
Indeed, such case may not be a common case.
{quote}
I think you've explored a number of options here and now we need to
see if we can make this committable, e.g. figure out how to have
BooleanQuery pick the right scorer for the situation? Somehow we need
logic that looks at how many / cost of the sub-clauses and picks the
right scorer?
{quote}
Yeah, you're right. 

Besides, a new idea has come up to me. For BNS, we actually does not
make use of the hash feature of BucketTable. Thus, I think we should not 
take BucketTable as a hash table (ie. do not place doc to the absolute place 
buckets[doc & MASK]).
Firstly, we get 2K required docs to BucketTable. Then, we do TAAT on these 2K 
docs.


was (Author: dhuang):
{quote}
 Looks like you separated required & optional
scores in the non-DAAT impls and then carefully cast to float at the
right times?
{quote}
Yes, you get what I mean.
{quote}
you can remove that TODO in ConjunctionScorer on
switching sum to double?
{quote}
OK, I will do that on next patch.
{quote}
{quote}
So BooleanScorerIO is just like BooleanNovelScorer, except it uses a
bitset instead of linked list to track the set buckets? Between BNS
and BSIO which one is faster?
{quote}
Yes. exactly. According to perf. tests, it seems that
BNS do better for those tasks faster than the trunk, 
while do better for those tasks slower than the trunk.
{quote}
Why does BSIO/NS see massive gains on the tasks that have so many NOT
clauses? I think in trunk/4.x today, we are not scoring the NOT
clauses, right? While these gains are sizable, I think it's not a
common use case...
{quote}
The reason is that when we search for "+a -b -c -d", 
lucene actually do "+a -(b c d)" and the cost of getting disjunction of (b c d) 
is huge.
Indeed, such case may not be a common case.
{quote}
I think you've explored a number of options here and now we need to
see if we can make this committable, e.g. figure out how to have
BooleanQuery pick the right scorer for the situation? Somehow we need
logic that looks at how many / cost of the sub-clauses and picks the
right scorer?
{quote}
Yeah, you're right. 

Besides, a new idea has come up to me. For BNS, we actually does not
make use of the hash feature of BucketTable. Thus, I think we should not 
take BucketTable as a hash table (ie. do not place doc to the absolute place 
buckets[doc & MASK]).
Firstly, we get 2K required docs to BucketTable. Then, we do TAAT on these 2K 
docs.

> BooleanScorer should sometimes be used for MUST clauses
> -------------------------------------------------------
>
>                 Key: LUCENE-4396
>                 URL: https://issues.apache.org/jira/browse/LUCENE-4396
>             Project: Lucene - Core
>          Issue Type: Improvement
>            Reporter: Michael McCandless
>         Attachments: And.tasks, AndOr.tasks, AndOr.tasks, LUCENE-4396.patch, 
> LUCENE-4396.patch, LUCENE-4396.patch, LUCENE-4396.patch, LUCENE-4396.patch, 
> LUCENE-4396.patch, LUCENE-4396.patch, luceneutil-score-equal.patch, 
> luceneutil-score-equal.patch
>
>
> Today we only use BooleanScorer if the query consists of SHOULD and MUST_NOT.
> If there is one or more MUST clauses we always use BooleanScorer2.
> But I suspect that unless the MUST clauses have very low hit count compared 
> to the other clauses, that BooleanScorer would perform better than 
> BooleanScorer2.  BooleanScorer still has some vestiges from when it used to 
> handle MUST so it shouldn't be hard to bring back this capability ... I think 
> the challenging part might be the heuristics on when to use which (likely we 
> would have to use firstDocID as proxy for total hit count).
> Likely we should also have BooleanScorer sometimes use .advance() on the subs 
> in this case, eg if suddenly the MUST clause skips 1000000 docs then you want 
> to .advance() all the SHOULD clauses.
> I won't have near term time to work on this so feel free to take it if you 
> are inspired!



--
This message was sent by Atlassian JIRA
(v6.2#6252)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Comment Edited] (LUCENE-4396) BooleanScorer should sometimes be used for MUST clauses

Reply via email to