[jira] [Issue Comment Edited] (LUCENE-3215) SloppyPhraseScorer sometimes computes Infinite freq

Doron Cohen (JIRA) Sat, 17 Sep 2011 11:57:32 -0700

    [ 
https://issues.apache.org/jira/browse/LUCENE-3215?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13107209#comment-13107209
 ]


Doron Cohen edited comment on LUCENE-3215 at 9/17/11 6:56 PM:
--------------------------------------------------------------

OK I think I have a fix for this.

While looking at it, I realized that PhraseScorer (the one that used to base 
both Exact&Sloppy phrase scorers but now is the base of only sloppy phrase 
scorer) is way too complicated and inefficient. All those sort calls after each 
matching doc can be avoided. 

So I am modifying PhraseScorer to not have a phrase-queue at all - just the 
sorted linked list, which is always kept sorted by advancing last beyond first. 
Last is renamed to 'min' and first is renamed to 'max'. Making the list cyclic 
allows more efficient manipulation of it. 

With this, SloppyPhraseScorer is modified to maintain its own phrase queue. The 
queue size is set at the first candidate document. In order to handle 
repetitions (Same term in different query offsets) it will contain only some of 
the pps: those that either have no repetitions, or are the first (lower query 
offset) in a repeating group. A linked list of repeating pps was added: so 
PhrasePositions has a new member: nextRepeating.

Detection of repeating pps and creation of that list is done once per scorer: 
at the first candidate doc.

For solving the bugs reported here, in addition to the initiation of 'end' as 
explained in previous comment, advanceRepeatingPPs now also update two values:
- end, in case one of the repeating pps is far ahead (larger)
- position of the first pp in a repeating list (the one that is in the queue - 
in case the repeating pp is far behind (smaller). This can happen when there 
are holes in the query, as position = tpPOs - offset. It fixes the problem of 
false negative distances which caused this bug. It is tricky: relies on that 
PhrasePositions.nextPosition() ignores pp.position and just call 
positions.nextPosition(). But it is correct, as the modified position is used 
to replace pp in the queue.

Last, I think that the test added with holes had one wrong assert: It added 
four docs:
- drug drug
- drug druggy drug
- drug druggy druggy drug
- drug druggy drug druggy drug

defined this query (number is the offset):
- drug(1) drug(3)

and expected that with slop=1 the first doc would not be found.
I think it should be found, as the slop operates in both directions.
So modified the query to: drug(1) drug(3)

Patch to follow.

      was (Author: doronc):
    OK I think I have a fix for this.

While looking at it, I realized that PhraseScorer (the one that used to base 
both Exact&Sloppy phrase scorers but now is the base of only sloppy phrase 
scorer) is way too complicated and inefficient. All those sort calls after each 
matching doc can be avoided. 

So I am modifying PhraseScorer to not have a phrase-queue at all - just the 
sorted linked list, which is always kept sorted by advancing last beyond first. 
Last is renamed to 'min' and first is renamed to 'max'. Making the list cyclic 
allows more efficient manipulation of it. 

With this, SloppyPhraseScorer is modified to maintain its own phrase queue. The 
queue size is set at the first candidate document. In order to handle 
repetitions (Same term in different query offsets) it will contain only some of 
the pps: those that either have no repetitions, or are the first (lower query 
offset) in a repeating group. A linked list of repeating pps was added: so 
PhrasePositions has a new member: nextRepeating.

Detection of repeating pps and creation of that list is done once per scorer: 
at the first candidate doc.

For solving the bugs reported here, in addition to the initiation of 'end' as 
explained in previous comment, advanceRepeatingPPs now also update two values:
- end, in case one of the repeating pps is far ahead (larger)
- position of the first pp in a repeating list (the one that is in the queue - 
in case the repeating pp is far behind (smaller). This can happen when there 
are holes in the query, as position = tpPOs - offset. It fixes the problem of 
false negative distances which caused this bug. It is tricky: relies on that 
PhrasePositions.nextPosition() ignores pp.position and just call 
positions.nextPosition(). But it is correct, as the modified position is used 
to replace pp in the queue.

Last, I think that the test added with holes had one wrong assert: It added 
four docs:
- drug drug
- drug druggy drug
- drug druggy druggy drug
- drug druggy drug druggy drug
defined this query (number is the offset):
- drug(1) drug(3)
and expected that with slop=1 the first doc would not be found.
I think it should be found, as the slop operates in both directions.
So modified the query to: drug(1) drug(3)

Patch to follow.
  
> SloppyPhraseScorer sometimes computes Infinite freq
> ---------------------------------------------------
>
>                 Key: LUCENE-3215
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3215
>             Project: Lucene - Java
>          Issue Type: Bug
>            Reporter: Robert Muir
>            Assignee: Doron Cohen
>         Attachments: LUCENE-3215.patch, LUCENE-3215_test.patch, 
> LUCENE-3215_test.patch
>
>
> reported on user list:
> http://www.lucidimagination.com/search/document/400cbc528ed63db9/score_of_infinity_on_dismax_query

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Issue Comment Edited] (LUCENE-3215) SloppyPhraseScorer sometimes computes Infinite freq

Reply via email to