[
https://issues.apache.org/jira/browse/LUCENE-8196?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16391723#comment-16391723
]
Jim Ferenczi commented on LUCENE-8196:
--------------------------------------
{quote}
I was a bit annoyed to see the field masking hack but actually those intervals
source do not need term statistics which makes the hack less horrible. Could
you still document it to make sure users are aware it is a hack and explain it
which circumstances it might be ok?
{quote}
I think that the proposed API should be more restrictive regarding the
targeted field. Could we restrict the IntervalsSource to work on a single field
? Something like:
{code:java}
public abstract class IntervalsSource {
protected final String field;
public IntervalsSource(String field) {
this.field = field;
}
public abstract IntervalIterator intervals(LeafReaderContext ctx) throws
IOException;
...
{code}
... and then we can check in each implementation that the sources are all
targeting the same field.
I understand that it might be powerful to mix multiple fields in an interval
query but with the current API that seems to be the norm rather than an
exception. We can add the field masking hack afterward but for the first
iteration I think it's better to focus on the main use case for this new query
which is to provide a way to find the minimum intervals in a single field.
Regarding the score of the intervals, it seems that the patch uses the inverse
length of the interval rather than the slop within the interval like the sloppy
phrase scorer. Could we compute the total slop of the current interval (as the
sum of the slop of each interval source that composed this interval) and use
its inverse to score each ? This would make different interval query more
comparable in terms of score since an interval with few terms and a slop>0
would score less that one with more terms but no slop.
I'll look deeper at the implementation of the different queries but I like the
simplicity of the patch and the fact that there is a paper with a proof for
each of them.
> Add IntervalQuery and IntervalsSource to expose minimum interval semantics
> across term fields
> ---------------------------------------------------------------------------------------------
>
> Key: LUCENE-8196
> URL: https://issues.apache.org/jira/browse/LUCENE-8196
> Project: Lucene - Core
> Issue Type: New Feature
> Reporter: Alan Woodward
> Assignee: Alan Woodward
> Priority: Major
> Attachments: LUCENE-8196.patch
>
>
> This ticket proposes an alternative implementation of the SpanQuery family
> that uses minimum-interval semantics from
> [http://vigna.di.unimi.it/ftp/papers/EfficientAlgorithmsMinimalIntervalSemantics.pdf]
> to implement positional queries across term-based fields. Rather than using
> TermQueries to construct the interval operators, as in LUCENE-2878 or the
> current Spans implementation, we instead use a new IntervalsSource object,
> which will produce IntervalIterators over a particular segment and field.
> These are constructed using various static helper methods, and can then be
> passed to a new IntervalQuery which will return documents that contain one or
> more intervals so defined.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]