[jira] [Commented] (LUCENE-3821) SloppyPhraseScorer sometimes misses documents that ExactPhraseScorer finds.

Doron Cohen (Commented) (JIRA) Sat, 03 Mar 2012 15:46:22 -0800

    [ 
https://issues.apache.org/jira/browse/LUCENE-3821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13221737#comment-13221737
 ]


Doron Cohen commented on LUCENE-3821:
-------------------------------------

I understand the problem. 

It all has to do - as Robert mentioned - with the repeating terms in the phrase 
query. I am working on a patch - it will change the way that repeats are 
handled. 

Repeating PPs require additional computation - and current SloppyPhraseScorer 
attempts to do that additional work efficiently, but over simplifies in that 
and fail to cover all cases. 

In the core of things, each time a repeating PP is selected (from the queue) 
and  propagated, *all* its sibling repeaters are propagated as well, to prevent 
a case that two repeating PPs point to the same document position (which was 
the bug that originally triggered handling repeats in this code). 

But this is wrong, because it propagates all siblings repeaters, and misses 
some cases.

Also, the code is hard to read, as Mike noted in LUCENE-2410 ([this 
comment|https://issues.apache.org/jira/browse/LUCENE-2410?focusedCommentId=12879443&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-12879443])
 ).

So this is a chance to also make the code more maintainable.

I have a working version which is not ready to commit yet, and all the tests 
pass, except for one which I think is a bug in ExactPhraseScorer, but maybe i 
am missing something. 

The case that fails is this:

{noformat}
AssertionError: Missing in super-set: doc 706
q1: field:"(j o s) (i b j) (t d)"
q2: field:"(j o s) (i b j) (t d)"~1
td1: [doc=706 score=7.7783184 shardIndex=-1, doc=175 score=6.222655 
shardIndex=-1]
td2: [doc=523 score=5.5001016 shardIndex=-1, doc=957 score=5.5001016 
shardIndex=-1, doc=228 score=4.400081 shardIndex=-1, doc=357 score=4.400081 
shardIndex=-1, doc=390 score=4.400081 shardIndex=-1, doc=503 score=4.400081 
shardIndex=-1, doc=602 score=4.400081 shardIndex=-1, doc=757 score=4.400081 
shardIndex=-1, doc=758 score=4.400081 shardIndex=-1]
doc 706: Document<stored,indexed,tokenized<field:s o b h j t j z o>>
{noformat}

It seems that q1 too should not match this document?
                
> SloppyPhraseScorer sometimes misses documents that ExactPhraseScorer finds.
> ---------------------------------------------------------------------------
>
>                 Key: LUCENE-3821
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3821
>             Project: Lucene - Java
>          Issue Type: Bug
>    Affects Versions: 3.5, 4.0
>            Reporter: Naomi Dushay
>            Assignee: Doron Cohen
>         Attachments: LUCENE-3821_test.patch, schema.xml, solrconfig-test.xml
>
>
> The general bug is a case where a phrase with no slop is found,
> but if you add slop its not.
> I committed a test today (TestSloppyPhraseQuery2) that actually triggers this 
> case,
> jenkins just hasn't had enough time to chew on it.
> ant test -Dtestcase=TestSloppyPhraseQuery2 -Dtests.iter=100 is enough to make 
> it fail on trunk or 3.x

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (LUCENE-3821) SloppyPhraseScorer sometimes misses documents that ExactPhraseScorer finds.

Reply via email to