[jira] [Commented] (LUCENE-9204) Move span queries to the queries module

Michael Gibney (Jira) Thu, 17 Jun 2021 12:53:05 -0700


    [ 
https://issues.apache.org/jira/browse/LUCENE-9204?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17365106#comment-17365106
 ]


Michael Gibney commented on LUCENE-9204:
----------------------------------------

I hope it's ok to post this here; I've [added 
benchmarks|https://github.com/mikemccand/luceneutil/pull/133] with the goal of 
quantifying performance for these different approaches. 500k docs from 
wikimedium; baseline and candidate code are the same, since I'm initially 
seeking to compare different queries, not different code.

First, a realistic use-case, somewhat contrived to exercise 
{{pullUpDisjunctions()}}:
{code}
# (body:us|united-states health|health-care policy|public-policy 
law|legal-aspects)~10

           Task QPS baseline      StdDev    QPS candidate      StdDev           
       Pct diff   p-value
    IntervalDis        20.34     (11.4%)            19.83      (9.2%)     -2.5% 
( -20% -   20%)     0.446
 IntervalMinDis        34.03      (9.9%)            35.22      (9.5%)      3.5% 
( -14% -   25%)     0.251
        SpanDis        63.63     (10.4%)            68.56     (11.0%)      7.8% 
( -12% -   32%)     0.022
{code}
 

Next, an intensive use-case, contrived to push/illustrate the performance 
profile of increasing the numbers of internal disjunctions:
{code}
# (body:smith a|in-the)~10
# (body:smith a|in-the the|in-the)~10
# (body:smith a|in-the the|in-the a|in-the)~10
# (body:smith a|in-the the|in-the a|in-the the|in-the)~10
# (body:smith a|in-the the|in-the a|in-the the|in-the a|in-the)~10
# (body:smith a|in-the the|in-the a|in-the the|in-the a|in-the the|in-the)~10
# NOTE: "smith" is arbitrary; just to push QPS numbers into a more 
human-friendly range

           Task QPS baseline      StdDev    QPS candidate      StdDev           
       Pct diff   p-value
   IntervalDis1        82.47      (2.3%)            81.27      (1.9%)     -1.5% 
(  -5% -    2%)     0.276
   IntervalDis2        25.96      (1.3%)            25.91      (1.7%)     -0.2% 
(  -3% -    2%)     0.851
   IntervalDis3         9.46      (2.3%)             9.46      (3.4%)     -0.0% 
(  -5% -    5%)     0.986
   IntervalDis4         3.69      (2.1%)             3.69      (2.3%)      0.1% 
(  -4% -    4%)     0.962
   IntervalDis5         1.57      (1.1%)             1.56      (0.9%)     -0.7% 
(  -2% -    1%)     0.282
   IntervalDis6         0.66      (0.6%)             0.66      (1.5%)     -0.6% 
(  -2% -    1%)     0.414
IntervalMinDis1       130.06      (5.6%)           129.07      (4.8%)     -0.8% 
( -10% -   10%)     0.817
IntervalMinDis2       115.44      (6.3%)           116.59      (4.2%)      1.0% 
(  -8% -   12%)     0.769
IntervalMinDis3        97.24      (5.0%)            99.19      (7.6%)      2.0% 
( -10% -   15%)     0.625
IntervalMinDis4       100.28      (8.0%)           101.31      (3.1%)      1.0% 
(  -9% -   13%)     0.791
IntervalMinDis5       102.01      (8.0%)           101.34      (6.2%)     -0.6% 
( -13% -   14%)     0.886
IntervalMinDis6        99.96      (2.2%)            97.27      (7.0%)     -2.7% 
( -11% -    6%)     0.410
       SpanDis1        81.13      (4.0%)            80.34      (2.1%)     -1.0% 
(  -6% -    5%)     0.630
       SpanDis2        45.01      (1.6%)            44.21      (1.5%)     -1.8% 
(  -4% -    1%)     0.068
       SpanDis3        31.01      (2.0%)            31.21      (1.9%)      0.6% 
(  -3% -    4%)     0.608
       SpanDis4        24.36      (2.2%)            23.01      (5.7%)     -5.6% 
( -13% -    2%)     0.042
       SpanDis5        19.76      (4.0%)            20.22      (3.5%)      2.3% 
(  -4% -   10%)     0.324
       SpanDis6        17.29      (4.5%)            16.74      (5.9%)     -3.2% 
( -12% -    7%)     0.340
{code}
 

For good measure, I added two tasks that compare non-positional disjunctions 
across different implementations: SpanOrQuery and DisjunctionIntervalsSource. 
(fwiw, I'd guess the performance gap between straight disjunctions could 
probably be closed without too much work?)
{code}
#  (body:trash|waste|garbage|recycling|refuse)

             Task QPS baseline      StdDev       QPS candidate      StdDev      
           Pct diff     p-value
     PlainSpanDis        80.92     (11.3%)               82.80     (17.5%)     
2.3% ( -23% -   35%)       0.619
 PlainIntervalDis       142.66     (10.8%)              154.38     (13.6%)     
8.2% ( -14% -   36%)       0.035
{code}
 

> Move span queries to the queries module
> ---------------------------------------
>
>                 Key: LUCENE-9204
>                 URL: https://issues.apache.org/jira/browse/LUCENE-9204
>             Project: Lucene - Core
>          Issue Type: Improvement
>            Reporter: Alan Woodward
>            Assignee: Alan Woodward
>            Priority: Major
>             Fix For: main (9.0)
>
>          Time Spent: 1h
>  Remaining Estimate: 0h
>
> We have a slightly odd situation currently, with two parallel query 
> structures for building complex positional queries: the long-standing span 
> queries, in core; and interval queries, in the queries module.  Given that 
> interval queries solve at least some of the problems we've had with Spans, I 
> think we should be pushing users more towards these implementations.  It's 
> counter-intuitive to do that when Spans are in core though.  I've opened this 
> issue to discuss moving the spans package as a whole to the queries module.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-9204) Move span queries to the queries module

Reply via email to