[
https://issues.apache.org/jira/browse/LUCENE-3215?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13107209#comment-13107209
]
Doron Cohen edited comment on LUCENE-3215 at 9/17/11 6:56 PM:
--------------------------------------------------------------
OK I think I have a fix for this.
While looking at it, I realized that PhraseScorer (the one that used to base
both Exact&Sloppy phrase scorers but now is the base of only sloppy phrase
scorer) is way too complicated and inefficient. All those sort calls after each
matching doc can be avoided.
So I am modifying PhraseScorer to not have a phrase-queue at all - just the
sorted linked list, which is always kept sorted by advancing last beyond first.
Last is renamed to 'min' and first is renamed to 'max'. Making the list cyclic
allows more efficient manipulation of it.
With this, SloppyPhraseScorer is modified to maintain its own phrase queue. The
queue size is set at the first candidate document. In order to handle
repetitions (Same term in different query offsets) it will contain only some of
the pps: those that either have no repetitions, or are the first (lower query
offset) in a repeating group. A linked list of repeating pps was added: so
PhrasePositions has a new member: nextRepeating.
Detection of repeating pps and creation of that list is done once per scorer:
at the first candidate doc.
For solving the bugs reported here, in addition to the initiation of 'end' as
explained in previous comment, advanceRepeatingPPs now also update two values:
- end, in case one of the repeating pps is far ahead (larger)
- position of the first pp in a repeating list (the one that is in the queue -
in case the repeating pp is far behind (smaller). This can happen when there
are holes in the query, as position = tpPOs - offset. It fixes the problem of
false negative distances which caused this bug. It is tricky: relies on that
PhrasePositions.nextPosition() ignores pp.position and just call
positions.nextPosition(). But it is correct, as the modified position is used
to replace pp in the queue.
Last, I think that the test added with holes had one wrong assert: It added
four docs:
- drug drug
- drug druggy drug
- drug druggy druggy drug
- drug druggy drug druggy drug
defined this query (number is the offset):
- drug(1) drug(3)
and expected that with slop=1 the first doc would not be found.
I think it should be found, as the slop operates in both directions.
So modified the query to: drug(1) drug(3)
Patch to follow.
was (Author: doronc):
OK I think I have a fix for this.
While looking at it, I realized that PhraseScorer (the one that used to base
both Exact&Sloppy phrase scorers but now is the base of only sloppy phrase
scorer) is way too complicated and inefficient. All those sort calls after each
matching doc can be avoided.
So I am modifying PhraseScorer to not have a phrase-queue at all - just the
sorted linked list, which is always kept sorted by advancing last beyond first.
Last is renamed to 'min' and first is renamed to 'max'. Making the list cyclic
allows more efficient manipulation of it.
With this, SloppyPhraseScorer is modified to maintain its own phrase queue. The
queue size is set at the first candidate document. In order to handle
repetitions (Same term in different query offsets) it will contain only some of
the pps: those that either have no repetitions, or are the first (lower query
offset) in a repeating group. A linked list of repeating pps was added: so
PhrasePositions has a new member: nextRepeating.
Detection of repeating pps and creation of that list is done once per scorer:
at the first candidate doc.
For solving the bugs reported here, in addition to the initiation of 'end' as
explained in previous comment, advanceRepeatingPPs now also update two values:
- end, in case one of the repeating pps is far ahead (larger)
- position of the first pp in a repeating list (the one that is in the queue -
in case the repeating pp is far behind (smaller). This can happen when there
are holes in the query, as position = tpPOs - offset. It fixes the problem of
false negative distances which caused this bug. It is tricky: relies on that
PhrasePositions.nextPosition() ignores pp.position and just call
positions.nextPosition(). But it is correct, as the modified position is used
to replace pp in the queue.
Last, I think that the test added with holes had one wrong assert: It added
four docs:
- drug drug
- drug druggy drug
- drug druggy druggy drug
- drug druggy drug druggy drug
defined this query (number is the offset):
- drug(1) drug(3)
and expected that with slop=1 the first doc would not be found.
I think it should be found, as the slop operates in both directions.
So modified the query to: drug(1) drug(3)
Patch to follow.
> SloppyPhraseScorer sometimes computes Infinite freq
> ---------------------------------------------------
>
> Key: LUCENE-3215
> URL: https://issues.apache.org/jira/browse/LUCENE-3215
> Project: Lucene - Java
> Issue Type: Bug
> Reporter: Robert Muir
> Assignee: Doron Cohen
> Attachments: LUCENE-3215.patch, LUCENE-3215_test.patch,
> LUCENE-3215_test.patch
>
>
> reported on user list:
> http://www.lucidimagination.com/search/document/400cbc528ed63db9/score_of_infinity_on_dismax_query
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]