[jira] [Comment Edited] (LUCENE-7398) Nested Span Queries are buggy

Christoph Goller (JIRA) Tue, 06 Sep 2016 08:39:30 -0700

    [ 
https://issues.apache.org/jira/browse/LUCENE-7398?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15466888#comment-15466888
 ]


Christoph Goller edited comment on LUCENE-7398 at 9/6/16 3:37 PM:
------------------------------------------------------------------

After thoroughly reviewing the current implementations of SpanNearQuery, 
PhraseQuery and MultiPhraseQuery I see some problems and inconsistencies. I 
volunteer to fix at least some of these problems, but first I would like to 
have a consensus about the desired bahavior of SpanQuery. This ticket may not 
be the right place for such a discussion, so please point me to a better place 
if there is one. 

1) Missing Matches caused by lazy iteration:

I think lazy iteration is not a new thing in Lucene SpanNearQuery. As far as I 
know there never was an implementation that compared all possible combinations 
of subspan matches for SpanNearQuery in Lucene. So SpanNearQuery always missed 
some matches.

*) This ticket demonstrates missing matches for ordered SpanQuery. Documents 
that should match don't match. This is caused by subspans of SpanNearQuery 
having a variable match length. For these cases the lazy iteration 
implementation which tries to optimize the number of comparisons of subspan 
matches is not sufficient.

*) Tim tried these examples with unorderd SpanQuery and got the same bahavior. 
I think this is caused by a similar kind of lazy iteration in the unordered 
case.

*) In the unordered case lazy iteration also causes problems if the subspans do 
not have variable-length matches. This is demonstrated in LUCENE-5331 and 
LUCENE-2861. Tim, thanks for pointing to these tickets. In these examples all 
clauses of the SpanNearQuery were SpanTermQueries, but some occured more than 
once. For PhraseQuery and MultiPhraseQuery and their implementation in 
SloppyPhraseScore this seems to be a known problem that has been solved by a 
special complex treatment of repetitions that I currently don't understand in 
detail.

My current opinion: We should give up lazy iteration for the unordered and the 
ordered case to solve these problems. I think it can be done and the 
performance peanalty should not be too big. We already iterate over all 
positions of all subspans. So we already have done the expensive operation of 
reading them. Should some more comparisons of int-values (positions) really 
matter so much? At least for the ordered case I am optimistic that I could 
implement it efficiently.

2) Inconsistent Scoring of SpanNearQuery

*) Lazy iteration means that some "redundant" matches in a document are skipped 
in order to have a faster matching algorithm. I am not sure how redundant was 
defined exactly for the idea of lazy iteration. It referred to matches with the 
same start posisiton somehow. As long as different matches for the first clause 
are concerned, they are found, but not the all matches for intermediate 
subclauses are regarded. Skipping matches however reduces the frequency that is 
computed and consequently the score. See Javadoc of phraseFreq() in 
SloppyPhraseScore which mention the same phenomenon. This is quite important 
for my use case of SpanQueries. I have different versions/variants of the same 
term on the same position, e.g. one with case-normalization and one without and 
I want a higher score if the user-query matches for more than one variant, and 
I use this approach for clauses of SpanNearQuery.

*) In NearSpansOrdered the method width() (it is used to compute sloppy 
frequency in SpanScore) returns the number of gaps between the matches. If you 
have a perfect match it returns 0 (no sloppyness). In NearSpansUnordered it 
returns the length of the match, not the number of gaps. See atMatch() for the 
difference. The reason is probably, that (maxEndPositionCell.endPosition() - 
minPositionCell().startPosition() - totalSpanLength) might even become negative 
if matches overlap. I would prefer something like Math.max(0, 
(maxEndPositionCell.endPosition() - minPositionCell().startPosition() - 
totalSpanLength))

*) SpanOrQuery and SpanNearQuery completely ignore the scores of their 
subclauses  (subweights are always generated as non-scoring). A SpanOrQuery 
should give a Score similar to a BooleanQuery, shouldn't it? As long as we have 
this behavior, SpanBoostQuery does not make any sense, doese it? So to my 
opinion the existance of SpanBoostQuery shows that others also had the idea 
that a nested SpanQuery should somehow use the scores of their clauses for the 
computation of their own score.



was (Author: [email protected]):
After thoroughly reviewing the current implementations of SpanNearQuery, 
PhraseQuery and MultiPhraseQuery I see some problems and inconsistencies. I 
volunteer to fix at least some of these problems, but first I would like to 
have a consensus about the desired bahavior of SpanQuery. This ticket may not 
be the right place for such a discussion, so please point me to a better place 
if there is one. 

1) Missing Matches caused by lazy iteration:

I think lazy iteration is not a new thing in Lucene SpanNearQuery. As far as I 
know there never was an implementation that compared all possible combinations 
of subspan matches for SpanNearQuery in Lucene. So SpanNEarQuery always missed 
some matches.

*) This ticket demonstrates missing matches for ordered SpanQuery. Documents 
that should match aer not matching. They are caused by subspans of 
SpanNearQuery having a variable match length. For these cases the lazy 
iteration implementation which tries to optimize the number of comparisons of 
subspan matches is not sufficient.

*) Tim tried these examples with unorderd SpanQuery and got the same bahavior. 
I think this is caused by a similar kind of lazy iteration in the unordered 
case.

*) In the unorderd case lazy iteration also causes problems if the subspans do 
not have variable-length matches. This is demonstrated in LUCENE-5331 and 
LUCENE-2861. Tim, thanks for pointing to these tickets. In these examples all 
clauses of the SpanNearQuery were SpanTermQueries, but some occured more than 
once. For PhraseQuery and MultiPhraseQuery and their implementation in 
SloppyPhraseScore this seems to be a known problem that has been solved by a 
special complex treatment of repetitions that I currently don't understand in 
detail.

My current opinion: We should give up lazy iteration for the unorderd and the 
ordered case to solve these problems. I think it can be done and the 
performance peanalty should not be too big. We already iterate over all 
positions of all subspans. So we already have done the expensive operation of 
reading them. Should some more comparisons of int-values (positions) really 
matter so much? At least fo the ordered case I am optimistic that I could 
implement it efficiently.

2) Inconsistent Scoring of SpanNearQuery

*) Lazy iteration means that some "redundant" matches in a document are skipped 
in order to have a faster matching algorithm. I am not sure how redundant was 
defined exactly for the idea of lazy iteration. It referred to matches with the 
same start posisiton somehow. As long as different matches for the first clause 
are concerned, they are found, but not the all matches for intermediate 
subclauses are regarded. Skipping matches however reduces the frequency that is 
computed and consequently the score. See Javadoc of phraseFreq() in 
SloppyPhraseScore which mention the same phenomenon. This is quite important 
for my use case of SpanQueries. I have different versions/variants of the same 
term on the same position, e.g. one with case-normalization and one without and 
I want a higher score if the user-query matches for more than one variant, and 
I use this approach for clauses of SpanNearQuery.

*) In NearSpansOrdered the method width() (it is used to compute sloppy 
frequency in SpanScore) returns the number of gaps between the matches. If you 
have a perfect match it returns 0 (no sloppyness). In NearSpansUnordered it 
returns the length of the match, not the number of gaps. See atMatch() for the 
difference. The reason is probably, that (maxEndPositionCell.endPosition() - 
minPositionCell().startPosition() - totalSpanLength) might even become negative 
if matches overlap. I would prefer something like Math.max(0, 
(maxEndPositionCell.endPosition() - minPositionCell().startPosition() - 
totalSpanLength))

*) SpanOrQuery and SpanNearQuery completely ignore the scores of their 
subclauses  (subweights are always generated as non-scoring). A SpanOrQuery 
should give a Score similar to a BooleanQuery, shouldn't it? As long as we have 
this behavior, SpanBoostQuery does not make any sense, doese it? So to my 
opinion the existance of SpanBoostQuery shows that others also had the idea 
that a nested SpanQuery should somehow use the scores of their clauses for the 
computation of their own score.


> Nested Span Queries are buggy
> -----------------------------
>
>                 Key: LUCENE-7398
>                 URL: https://issues.apache.org/jira/browse/LUCENE-7398
>             Project: Lucene - Core
>          Issue Type: Bug
>          Components: core/search
>    Affects Versions: 5.5, 6.x
>            Reporter: Christoph Goller
>            Assignee: Alan Woodward
>            Priority: Critical
>         Attachments: LUCENE-7398-20160814.patch, LUCENE-7398.patch, 
> LUCENE-7398.patch, TestSpanCollection.java
>
>
> Example for a nested SpanQuery that is not working:
> Document: Human Genome Organization , HUGO , is trying to coordinate gene 
> mapping research worldwide.
> Query: spanNear([body:coordinate, spanOr([spanNear([body:gene, body:mapping], 
> 0, true), body:gene]), body:research], 0, true)
> The query should match "coordinate gene mapping research" as well as 
> "coordinate gene research". It does not match  "coordinate gene mapping 
> research" with Lucene 5.5 or 6.1, it did however match with Lucene 4.10.4. It 
> probably stopped working with the changes on SpanQueries in 5.3. I will 
> attach a unit test that shows the problem.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Comment Edited] (LUCENE-7398) Nested Span Queries are buggy

Reply via email to