[ https://issues.apache.org/jira/browse/LUCENE-7398?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15466888#comment-15466888 ]
Christoph Goller edited comment on LUCENE-7398 at 9/6/16 3:37 PM: ------------------------------------------------------------------ After thoroughly reviewing the current implementations of SpanNearQuery, PhraseQuery and MultiPhraseQuery I see some problems and inconsistencies. I volunteer to fix at least some of these problems, but first I would like to have a consensus about the desired bahavior of SpanQuery. This ticket may not be the right place for such a discussion, so please point me to a better place if there is one. 1) Missing Matches caused by lazy iteration: I think lazy iteration is not a new thing in Lucene SpanNearQuery. As far as I know there never was an implementation that compared all possible combinations of subspan matches for SpanNearQuery in Lucene. So SpanNearQuery always missed some matches. *) This ticket demonstrates missing matches for ordered SpanQuery. Documents that should match don't match. This is caused by subspans of SpanNearQuery having a variable match length. For these cases the lazy iteration implementation which tries to optimize the number of comparisons of subspan matches is not sufficient. *) Tim tried these examples with unorderd SpanQuery and got the same bahavior. I think this is caused by a similar kind of lazy iteration in the unordered case. *) In the unordered case lazy iteration also causes problems if the subspans do not have variable-length matches. This is demonstrated in LUCENE-5331 and LUCENE-2861. Tim, thanks for pointing to these tickets. In these examples all clauses of the SpanNearQuery were SpanTermQueries, but some occured more than once. For PhraseQuery and MultiPhraseQuery and their implementation in SloppyPhraseScore this seems to be a known problem that has been solved by a special complex treatment of repetitions that I currently don't understand in detail. My current opinion: We should give up lazy iteration for the unordered and the ordered case to solve these problems. I think it can be done and the performance peanalty should not be too big. We already iterate over all positions of all subspans. So we already have done the expensive operation of reading them. Should some more comparisons of int-values (positions) really matter so much? At least for the ordered case I am optimistic that I could implement it efficiently. 2) Inconsistent Scoring of SpanNearQuery *) Lazy iteration means that some "redundant" matches in a document are skipped in order to have a faster matching algorithm. I am not sure how redundant was defined exactly for the idea of lazy iteration. It referred to matches with the same start posisiton somehow. As long as different matches for the first clause are concerned, they are found, but not the all matches for intermediate subclauses are regarded. Skipping matches however reduces the frequency that is computed and consequently the score. See Javadoc of phraseFreq() in SloppyPhraseScore which mention the same phenomenon. This is quite important for my use case of SpanQueries. I have different versions/variants of the same term on the same position, e.g. one with case-normalization and one without and I want a higher score if the user-query matches for more than one variant, and I use this approach for clauses of SpanNearQuery. *) In NearSpansOrdered the method width() (it is used to compute sloppy frequency in SpanScore) returns the number of gaps between the matches. If you have a perfect match it returns 0 (no sloppyness). In NearSpansUnordered it returns the length of the match, not the number of gaps. See atMatch() for the difference. The reason is probably, that (maxEndPositionCell.endPosition() - minPositionCell().startPosition() - totalSpanLength) might even become negative if matches overlap. I would prefer something like Math.max(0, (maxEndPositionCell.endPosition() - minPositionCell().startPosition() - totalSpanLength)) *) SpanOrQuery and SpanNearQuery completely ignore the scores of their subclauses (subweights are always generated as non-scoring). A SpanOrQuery should give a Score similar to a BooleanQuery, shouldn't it? As long as we have this behavior, SpanBoostQuery does not make any sense, doese it? So to my opinion the existance of SpanBoostQuery shows that others also had the idea that a nested SpanQuery should somehow use the scores of their clauses for the computation of their own score. was (Author: gol...@detego-software.de): After thoroughly reviewing the current implementations of SpanNearQuery, PhraseQuery and MultiPhraseQuery I see some problems and inconsistencies. I volunteer to fix at least some of these problems, but first I would like to have a consensus about the desired bahavior of SpanQuery. This ticket may not be the right place for such a discussion, so please point me to a better place if there is one. 1) Missing Matches caused by lazy iteration: I think lazy iteration is not a new thing in Lucene SpanNearQuery. As far as I know there never was an implementation that compared all possible combinations of subspan matches for SpanNearQuery in Lucene. So SpanNEarQuery always missed some matches. *) This ticket demonstrates missing matches for ordered SpanQuery. Documents that should match aer not matching. They are caused by subspans of SpanNearQuery having a variable match length. For these cases the lazy iteration implementation which tries to optimize the number of comparisons of subspan matches is not sufficient. *) Tim tried these examples with unorderd SpanQuery and got the same bahavior. I think this is caused by a similar kind of lazy iteration in the unordered case. *) In the unorderd case lazy iteration also causes problems if the subspans do not have variable-length matches. This is demonstrated in LUCENE-5331 and LUCENE-2861. Tim, thanks for pointing to these tickets. In these examples all clauses of the SpanNearQuery were SpanTermQueries, but some occured more than once. For PhraseQuery and MultiPhraseQuery and their implementation in SloppyPhraseScore this seems to be a known problem that has been solved by a special complex treatment of repetitions that I currently don't understand in detail. My current opinion: We should give up lazy iteration for the unorderd and the ordered case to solve these problems. I think it can be done and the performance peanalty should not be too big. We already iterate over all positions of all subspans. So we already have done the expensive operation of reading them. Should some more comparisons of int-values (positions) really matter so much? At least fo the ordered case I am optimistic that I could implement it efficiently. 2) Inconsistent Scoring of SpanNearQuery *) Lazy iteration means that some "redundant" matches in a document are skipped in order to have a faster matching algorithm. I am not sure how redundant was defined exactly for the idea of lazy iteration. It referred to matches with the same start posisiton somehow. As long as different matches for the first clause are concerned, they are found, but not the all matches for intermediate subclauses are regarded. Skipping matches however reduces the frequency that is computed and consequently the score. See Javadoc of phraseFreq() in SloppyPhraseScore which mention the same phenomenon. This is quite important for my use case of SpanQueries. I have different versions/variants of the same term on the same position, e.g. one with case-normalization and one without and I want a higher score if the user-query matches for more than one variant, and I use this approach for clauses of SpanNearQuery. *) In NearSpansOrdered the method width() (it is used to compute sloppy frequency in SpanScore) returns the number of gaps between the matches. If you have a perfect match it returns 0 (no sloppyness). In NearSpansUnordered it returns the length of the match, not the number of gaps. See atMatch() for the difference. The reason is probably, that (maxEndPositionCell.endPosition() - minPositionCell().startPosition() - totalSpanLength) might even become negative if matches overlap. I would prefer something like Math.max(0, (maxEndPositionCell.endPosition() - minPositionCell().startPosition() - totalSpanLength)) *) SpanOrQuery and SpanNearQuery completely ignore the scores of their subclauses (subweights are always generated as non-scoring). A SpanOrQuery should give a Score similar to a BooleanQuery, shouldn't it? As long as we have this behavior, SpanBoostQuery does not make any sense, doese it? So to my opinion the existance of SpanBoostQuery shows that others also had the idea that a nested SpanQuery should somehow use the scores of their clauses for the computation of their own score. > Nested Span Queries are buggy > ----------------------------- > > Key: LUCENE-7398 > URL: https://issues.apache.org/jira/browse/LUCENE-7398 > Project: Lucene - Core > Issue Type: Bug > Components: core/search > Affects Versions: 5.5, 6.x > Reporter: Christoph Goller > Assignee: Alan Woodward > Priority: Critical > Attachments: LUCENE-7398-20160814.patch, LUCENE-7398.patch, > LUCENE-7398.patch, TestSpanCollection.java > > > Example for a nested SpanQuery that is not working: > Document: Human Genome Organization , HUGO , is trying to coordinate gene > mapping research worldwide. > Query: spanNear([body:coordinate, spanOr([spanNear([body:gene, body:mapping], > 0, true), body:gene]), body:research], 0, true) > The query should match "coordinate gene mapping research" as well as > "coordinate gene research". It does not match "coordinate gene mapping > research" with Lucene 5.5 or 6.1, it did however match with Lucene 4.10.4. It > probably stopped working with the changes on SpanQueries in 5.3. I will > attach a unit test that shows the problem. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org