[jira] [Commented] (LUCENE-8196) Add IntervalQuery and IntervalsSource to expose minimum interval semantics across term fields

Martin Hermann (JIRA) Tue, 28 Aug 2018 06:57:30 -0700


    [ 
https://issues.apache.org/jira/browse/LUCENE-8196?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16595002#comment-16595002
 ]


Martin Hermann commented on LUCENE-8196:
----------------------------------------

First of all, I really like this implementation and the ideas that went into 
it. But as I have spent quite some time with the old span queries and their 
problems, I'd like to comment on some things and maybe offer some fresh view 
points for old problems:
 
 
Obviously, maxwidth is not completely identical to specifying slop: Let's say 
we want to do some sort of synonym expansion and query for "("big bad" OR evil) 
wolf" (this is of course related to the prefix-problem we already know about 
("genome editing"), but I think still slightly different).
With span queries, this would have been possible, as we just have to set slop 
to 0 in all queries, but now we have to do something like
{code:java}
Intervals.maxwidth(3,
        Intervals.ordered(
            Intervals.or(
                Intervals.maxwidth(2,
                    
Intervals.ordered(Intervals.term("big"),Intervals.term("bad"))),
                Intervals.term("evil")),
            Intervals.term("wolf")));{code}
which also matches "evil eyes wolf", which should not be a match. It would be 
possible to rewrite the query so that the disjunction is at the top level, 
something like
{code:java}
Intervals.or(
    Intervals.maxwidth(2,
        Intervals.ordered(Intervals.term("evil"),Intervals.term("wolf"))),
    Intervals.maxwidth(3,
        
Intervals.ordreed(Intervals.term("big"),Intervals.term("bad"),Intervals.term("wolf"))));{code}

which would work as expected, but I think we can agree that this is not really 
a nice solution (but I will come back to it later).

 

Now, we already know that "(big OR "big bad") wolf" would not match "big bad 
wolf" (this is exactly the genome editing thing), but I think it is worth to 
point out exactly why: It actually should not match, according to the 
definition of "minimum interval": Any match for "big bad" is also a match for 
big, so the first IntervalsSource only passes matches for "big", and then we 
get no match for "big wolf". This is a feature of the query semantic of the 
paper (and maybe the reason for the efficency and simplicity of the 
algorithms): The problems that spanQueries had are gone, because we define the 
unexpected behaviour to be correct*. As much as I like the IntervalQueries, I 
do not really think this is satisfactory.

 
There are actually other, similar cases with containing/containedBy: Let's say 
our document is "big bad big wolf" and we want "bad wolf" (slop 1) to be 
contained by "big wolf" (slop 2). We would get no match in this document, as 
the minimal match for the big interval is just "big wolf" (as the other match, 
"big bad big wolf" contains this one). At least to me this is counter intuitive 
and I would expect the document to match.
It really gets strange if we mix in some "OR":
{noformat}
"big wolf" (slop 1) contained in ("big wolf" (slop 1) OR "bad wolf") {noformat}
does not match "big bad wolf", in contrast to
{noformat}
"big wolf" (slop 1) contained in ("big wolf" (slop 1)){noformat}
, which does. So we actually lose a match by adding a OR-clause, and I think we 
can agree that this is not really good. Of course these are not queries a human 
would write, but I think one major use case of span queries is some sort of 
automatic query generation, and that's where I think it is really important to 
meet at least some basic expectations (such as not losing matches by adding 
disjunctions).
 
I don't see a way to fix this that still follows minimal interval semantics, as 
all this is actually how it SHOULD work there, but this would mean we'd lose 
the correctness proofs. The only thing I can think of is some sort of query 
rewriting, pushing the disjunction as far top as neccessary, but this may be 
rather performance heavy and also does not solve the "bad wolf" (slop 1) 
contained by "big wolf" (slop 2) problem.
 
Any thoughts?

*A short theoretical aside: I think that most of the span query problems came 
from the fact that we want to have a "next match" function, i.e. some sort of 
ordering of matches, together with the nature of span query Matches, which are 
essentially a pair of numbers (start and end of match). This means we have to 
specify an order on pairs of numbers (which is possible, of course; the 
solution with span queries was a lexical order, i.e. the start always 
increases, and if it stays the same, the end increases). But I think it is not 
really possible to implement completly lazy behaviour with this ordering: Think 
of some ordered "((a OR b) followed by (c OR d)) with enough slop" and the 
document "a b c d" which should find "a b c d" before "b c" (as the start 
increases), but has to cache the match for "c", which (in the sub-query "(c OR 
d)) occurs before the one for "b". So the combination of ordering and lazy is 
where we get problems. The vigna paper is now quite clever, in that the very 
definition of "minimum interval" solves this problem, as the pair of start and 
end positions gets essentially reduced to a single number (e.g. start 
position), as it is not possible for one of them to increase with the other 
staying the same or even decreasing (as otherwise one of the matches would be 
contained in the other). I actually think that a possible solution would be to 
give up the "lazy" part instead and have some kind of caching, but that is not 
what this ticket is about.

> Add IntervalQuery and IntervalsSource to expose minimum interval semantics 
> across term fields
> ---------------------------------------------------------------------------------------------
>
>                 Key: LUCENE-8196
>                 URL: https://issues.apache.org/jira/browse/LUCENE-8196
>             Project: Lucene - Core
>          Issue Type: New Feature
>            Reporter: Alan Woodward
>            Assignee: Alan Woodward
>            Priority: Major
>             Fix For: 7.4
>
>         Attachments: LUCENE-8196-debug.patch, LUCENE-8196.patch, 
> LUCENE-8196.patch, LUCENE-8196.patch, LUCENE-8196.patch, LUCENE-8196.patch
>
>          Time Spent: 10m
>  Remaining Estimate: 0h
>
> This ticket proposes an alternative implementation of the SpanQuery family 
> that uses minimum-interval semantics from 
> [http://vigna.di.unimi.it/ftp/papers/EfficientAlgorithmsMinimalIntervalSemantics.pdf]
>  to implement positional queries across term-based fields.  Rather than using 
> TermQueries to construct the interval operators, as in LUCENE-2878 or the 
> current Spans implementation, we instead use a new IntervalsSource object, 
> which will produce IntervalIterators over a particular segment and field.  
> These are constructed using various static helper methods, and can then be 
> passed to a new IntervalQuery which will return documents that contain one or 
> more intervals so defined.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (LUCENE-8196) Add IntervalQuery and IntervalsSource to expose minimum interval semantics across term fields

Reply via email to