[jira] [Commented] (LUCENE-9204) Move span queries to the queries module

Michael Gibney (Jira) Mon, 21 Jun 2021 10:14:07 -0700


    [ 
https://issues.apache.org/jira/browse/LUCENE-9204?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17366743#comment-17366743
 ]


Michael Gibney commented on LUCENE-9204:
----------------------------------------

Thanks! Right, in this initial case, comparison of "baseline" and "candidate" 
is not significant. The way I'm thinking about this, there are three high-level 
ways in which these new tasks are useful:
# Comparison of 3 different implementations of inner-disjunction proximity 
query with each other: spans, intervals, and "linear minimized intervals".
# Compare the performance of each of the 3 implementations as the number of 
inner disjunctions (and thus, complexity of possible matches) increases.
# Establish a baseline for performance improvements (and to guard against 
performance regressions) in each of the three query implementations.

In the first of these, "spans" and "linear minimized intervals" (per the Vigna 
paper, iiuc) are most similar in their approach. Practically speaking, the 
difference between "linear"-scaling (with {{rewrite=false}}) and 
"non-linear"-scaling (the default) intervals could inform decisions about 
performance trade-off for different use cases, for users considering when to 
use {{rewrite=false}} vs {{rewrite=true}} intervals disjunction approaches.

I don't think the takeaway wrt "intervals" vs. "spans" performance is clear-cut.

I was surprised to find that spans appear to perform substantially better than 
either intervals implementation for the first ("realistic") use-case. Assuming 
this isn't an accident of the way testing is set up (I went out of my way to 
try to be fair), I'd guess this might be a consequence of different handling of 
high/low/mid-frequency terms?

I was equally surprised to find that intervals appear to perform substantially 
better than spans in the last ("straight-disjunction") case. To clarify, I 
mentioned that "the performance gap ... could probably be closed without too 
much work" not because I'm actually proposing such work, but more to point out 
that what each query is doing in the straight-disjunction case is fundamentally 
the same, so a word of caution about drawing too many conclusions from that 
result wrt _inherent_ performance characteristics of intervals vs. spans. Also 
worth noting: I had to [manually 
unwrap|https://github.com/mikemccand/luceneutil/pull/133/files#diff-809d06d69ce19243f246a4b6070b2c263edff403c10f92f4019db7ecef55c9dfR466-R471]
 the intervals queries in order to expose the performance gap; without this 
manual rewriting, spans vs. intervals performance for the straight-disjunction 
case was +/- identical -- ~40 QPS, iirc.

Intervals and spans could both benefit by quantifying these performance 
discrepancies in different cases -- by pointing out cases in each that might be 
ripe targets for optimization (unwrapping single-clause intervals in the latter 
case could be one such opportunity).

Another main takeaway from my perspective was to confirm the exponential 
performance implications of {{pullUpDisjunctions()}}, over increasing numbers 
of inner disjunctions. Granted this may just confirm what was already assumed; 
and at the moment, as Jim points out above, the question of correctness takes 
priority; but it's good to quantify the performance impact, and any future 
attempts to address the challenges of "graph" matching will benefit from having 
some existing benchmarks in place.

> Move span queries to the queries module
> ---------------------------------------
>
>                 Key: LUCENE-9204
>                 URL: https://issues.apache.org/jira/browse/LUCENE-9204
>             Project: Lucene - Core
>          Issue Type: Improvement
>            Reporter: Alan Woodward
>            Assignee: Alan Woodward
>            Priority: Major
>             Fix For: main (9.0)
>
>          Time Spent: 1h
>  Remaining Estimate: 0h
>
> We have a slightly odd situation currently, with two parallel query 
> structures for building complex positional queries: the long-standing span 
> queries, in core; and interval queries, in the queries module.  Given that 
> interval queries solve at least some of the problems we've had with Spans, I 
> think we should be pushing users more towards these implementations.  It's 
> counter-intuitive to do that when Spans are in core though.  I've opened this 
> issue to discuss moving the spans package as a whole to the queries module.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-9204) Move span queries to the queries module

Reply via email to