[ https://issues.apache.org/jira/browse/LUCENE-9204?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17366743#comment-17366743 ]
Michael Gibney commented on LUCENE-9204: ---------------------------------------- Thanks! Right, in this initial case, comparison of "baseline" and "candidate" is not significant. The way I'm thinking about this, there are three high-level ways in which these new tasks are useful: # Comparison of 3 different implementations of inner-disjunction proximity query with each other: spans, intervals, and "linear minimized intervals". # Compare the performance of each of the 3 implementations as the number of inner disjunctions (and thus, complexity of possible matches) increases. # Establish a baseline for performance improvements (and to guard against performance regressions) in each of the three query implementations. In the first of these, "spans" and "linear minimized intervals" (per the Vigna paper, iiuc) are most similar in their approach. Practically speaking, the difference between "linear"-scaling (with {{rewrite=false}}) and "non-linear"-scaling (the default) intervals could inform decisions about performance trade-off for different use cases, for users considering when to use {{rewrite=false}} vs {{rewrite=true}} intervals disjunction approaches. I don't think the takeaway wrt "intervals" vs. "spans" performance is clear-cut. I was surprised to find that spans appear to perform substantially better than either intervals implementation for the first ("realistic") use-case. Assuming this isn't an accident of the way testing is set up (I went out of my way to try to be fair), I'd guess this might be a consequence of different handling of high/low/mid-frequency terms? I was equally surprised to find that intervals appear to perform substantially better than spans in the last ("straight-disjunction") case. To clarify, I mentioned that "the performance gap ... could probably be closed without too much work" not because I'm actually proposing such work, but more to point out that what each query is doing in the straight-disjunction case is fundamentally the same, so a word of caution about drawing too many conclusions from that result wrt _inherent_ performance characteristics of intervals vs. spans. Also worth noting: I had to [manually unwrap|https://github.com/mikemccand/luceneutil/pull/133/files#diff-809d06d69ce19243f246a4b6070b2c263edff403c10f92f4019db7ecef55c9dfR466-R471] the intervals queries in order to expose the performance gap; without this manual rewriting, spans vs. intervals performance for the straight-disjunction case was +/- identical -- ~40 QPS, iirc. Intervals and spans could both benefit by quantifying these performance discrepancies in different cases -- by pointing out cases in each that might be ripe targets for optimization (unwrapping single-clause intervals in the latter case could be one such opportunity). Another main takeaway from my perspective was to confirm the exponential performance implications of {{pullUpDisjunctions()}}, over increasing numbers of inner disjunctions. Granted this may just confirm what was already assumed; and at the moment, as Jim points out above, the question of correctness takes priority; but it's good to quantify the performance impact, and any future attempts to address the challenges of "graph" matching will benefit from having some existing benchmarks in place. > Move span queries to the queries module > --------------------------------------- > > Key: LUCENE-9204 > URL: https://issues.apache.org/jira/browse/LUCENE-9204 > Project: Lucene - Core > Issue Type: Improvement > Reporter: Alan Woodward > Assignee: Alan Woodward > Priority: Major > Fix For: main (9.0) > > Time Spent: 1h > Remaining Estimate: 0h > > We have a slightly odd situation currently, with two parallel query > structures for building complex positional queries: the long-standing span > queries, in core; and interval queries, in the queries module. Given that > interval queries solve at least some of the problems we've had with Spans, I > think we should be pushing users more towards these implementations. It's > counter-intuitive to do that when Spans are in core though. I've opened this > issue to discuss moving the spans package as a whole to the queries module. -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org