[
https://issues.apache.org/jira/browse/LUCENE-7398?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15466888#comment-15466888
]
Christoph Goller commented on LUCENE-7398:
------------------------------------------
After thoroughly reviewing the current implementations of SpanNearQuery,
PhraseQuery and MultiPhraseQuery I see some problems and inconsistencies. I
volunteer to fix at least some of these problems, but first I would like to
have a consensus about the desired bahavior of SpanQuery. This ticket may not
be the right place for such a discussion, so please point me to a better place
if there is one.
1) Missing Matches caused by lazy iteration:
I think lazy iteration is not a new thing in Lucene SpanNearQuery. As far as I
know there never was an implementation that compared all possible combinations
of subspan matches for SpanNearQuery in Lucene. So SpanNEarQuery always missed
some matches.
*) This ticket demonstrates missing matches for ordered SpanQuery. They are
caused by subspans of SpanNearQuery having a variable match length. For these
cases the lazy iteration implementation which tries to optimize the number of
comparisons of subspan matches is not sufficient.
*) Tim tried these examples with unorderd SpanQuery and got the same bahavior.
I think this is caused by a similar kind of lazy iteration in the unordered
case.
*) In the unorderd case lazy iteration also causes problems if the subspans do
not have variable-length matches. This is demonstrated in LUCENE-5331 and
LUCENE-2861. Tim, thanks for pointing to these tickets. In these examples all
clauses of the SpanNearQuery were SpanTermQueries, but some occured more than
once. For PhraseQuery and MultiPhraseQuery and their implementation in
SloppyPhraseScore this seems to be a known problem that has been solved by a
special complex treatment of repetitions that I currently don't understand in
detail.
My current opinion: We should give up lazy iteration for the unorderd and the
ordered case to solve these problems. I think it can be done and the
performance peanalty should not be too big. We already iterate over all
positions of all subspans. So we already have done the expensive operation of
reading them. Should some more comparisons of int-values (positions) really
matter so much? At least fo the ordered case I am optimistic that I could
implement it efficiently.
2) Inconsistent Scoring of SpanNearQuery
*) Lazy iteration means that some "redundant" matches in a document are skipped
in order to have a faster matching algorithm. I am not sure how redundant was
defined exactly for the idea of lazy iteration. It referred to matches with the
same start posisiton somehow. Skpping matches however reduces the frequency
that is computed and consequently the score. See Javadoc of phraseFreq() in
SloppyPhraseScore which mention the same phenomenon. This is quite important
for my use case of SpanQueries. I have different versions/variants of the same
term on the same position, e.g. one with case-normalization and one without and
I want a higher score if the user-query matches for more than one variant, and
I use this approach for clauses of SpanNearQuery.
*) In NearSpansOrdered the method width() (it is used to compute sloppy
frequency in SpanScore) returns the number of gaps between the matches. If you
have a perfect match it returns 0 (no sloppyness). In NearSpansUnordered it
returns the length of the match, not the number of gaps. See atMatch() for the
difference. The reason is probably, that (maxEndPositionCell.endPosition() -
minPositionCell().startPosition() - totalSpanLength) might even become negative
if matches overlap. I would prefer something like Math.max(0,
(maxEndPositionCell.endPosition() - minPositionCell().startPosition() -
totalSpanLength))
*) SpanOrQuery and SpanNearQuery completely ignore the scores of their
subclauses (subweights are always generated as non-scoring). A SpanOrQuery
should give a Score similar to a BooleanQuery, shouldn't it? As long as we have
this behavior, SpanBoostQuery does not make any sense, doese it? So to my
opinion the existance of SpanBoostQuery shows that others also had the idea
that a nested SpanQuery should somehow use the scores of their clauses for the
computation of their own score.
> Nested Span Queries are buggy
> -----------------------------
>
> Key: LUCENE-7398
> URL: https://issues.apache.org/jira/browse/LUCENE-7398
> Project: Lucene - Core
> Issue Type: Bug
> Components: core/search
> Affects Versions: 5.5, 6.x
> Reporter: Christoph Goller
> Assignee: Alan Woodward
> Priority: Critical
> Attachments: LUCENE-7398-20160814.patch, LUCENE-7398.patch,
> LUCENE-7398.patch, TestSpanCollection.java
>
>
> Example for a nested SpanQuery that is not working:
> Document: Human Genome Organization , HUGO , is trying to coordinate gene
> mapping research worldwide.
> Query: spanNear([body:coordinate, spanOr([spanNear([body:gene, body:mapping],
> 0, true), body:gene]), body:research], 0, true)
> The query should match "coordinate gene mapping research" as well as
> "coordinate gene research". It does not match "coordinate gene mapping
> research" with Lucene 5.5 or 6.1, it did however match with Lucene 4.10.4. It
> probably stopped working with the changes on SpanQueries in 5.3. I will
> attach a unit test that shows the problem.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]