[jira] [Comment Edited] (LUCENE-6198) two phase intersection

Adrien Grand (JIRA) Fri, 13 Feb 2015 01:59:01 -0800

    [ 
https://issues.apache.org/jira/browse/LUCENE-6198?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14319789#comment-14319789
 ]


Adrien Grand edited comment on LUCENE-6198 at 2/13/15 9:57 AM:
---------------------------------------------------------------

I did some more benchmarking and something that helped was to flatten clauses 
in ConjunctionDISI. This typically means that {{+ "A B"  +C}} is now 
approximated as {{+A +B +C}} instead of {{+(+A +B) +C}}. (see attached patch)

Here are results on wikibig:

{noformat}
                    TaskQPS baseline      StdDev   QPS patch      StdDev        
        Pct diff
    AndMedPhraseHighTerm       21.19      (6.1%)       19.98      (2.6%)   
-5.7% ( -13% -    3%)
                PKLookup      334.11      (2.1%)      334.82      (2.2%)    
0.2% (  -4% -    4%)
   AndHighPhraseHighTerm       11.64      (4.1%)       11.83      (2.4%)    
1.6% (  -4% -    8%)
    AndHighPhraseMedTerm       19.19      (2.5%)       21.99      (2.1%)   
14.6% (   9% -   19%)
     AndMedPhraseMedTerm       58.27      (6.3%)       67.53      (6.6%)   
15.9% (   2% -   30%)
    AndHighPhraseLowTerm       35.07      (5.6%)       42.46      (6.1%)   
21.1% (   8% -   34%)
     AndMedPhraseLowTerm       93.39      (8.0%)      128.24     (13.3%)   
37.3% (  14% -   63%)
{noformat}

I was curious about the slow down on AndMedPhraseHighTerm. For instance we have 
{{+"los angeles" +title}}. {{title}} has a high doc frequency and so {{"los 
angeles"}} leas the iteration on trunk, meaning that we check positions on 
38591 documents (number of matches of {{+los +angeles}}). With the patch, we 
intersect with {{title}} before checking positions, meaning that we only check 
positions on 30711 documents. It seems to not be low enough compared to 38591 
to make the query faster.

However, if we take a query from AndMedPhraseLowTerm like {{+"los angeles" 
+rivers}}, this time we only check positions on 1238 documents instead of 
38591, hence the speedup.

Edit: fixed the explanation which was backwards :)


was (Author: jpountz):
I did some more benchmarking and something that helped was to flatten clauses 
in ConjunctionDISI. This typically means that {{+ "A B"  +C}} is now 
approximated as {{+A +B +C}} instead of {{+(+A +B) +C}}. (see attached patch)

Here are results on wikibig:

{noformat}
                    TaskQPS baseline      StdDev   QPS patch      StdDev        
        Pct diff
    AndMedPhraseHighTerm       21.19      (6.1%)       19.98      (2.6%)   
-5.7% ( -13% -    3%)
                PKLookup      334.11      (2.1%)      334.82      (2.2%)    
0.2% (  -4% -    4%)
   AndHighPhraseHighTerm       11.64      (4.1%)       11.83      (2.4%)    
1.6% (  -4% -    8%)
    AndHighPhraseMedTerm       19.19      (2.5%)       21.99      (2.1%)   
14.6% (   9% -   19%)
     AndMedPhraseMedTerm       58.27      (6.3%)       67.53      (6.6%)   
15.9% (   2% -   30%)
    AndHighPhraseLowTerm       35.07      (5.6%)       42.46      (6.1%)   
21.1% (   8% -   34%)
     AndMedPhraseLowTerm       93.39      (8.0%)      128.24     (13.3%)   
37.3% (  14% -   63%)
{noformat}

I was curious about the slow down on AndMedPhraseHighTerm. And actually it 
seems to be tied to the fact that terms are not random. For instance one query 
of this task is {{+"los angeles" +title}} which matches 30669 documents. 
However the approximation is {{+los +angeles +title}} and matches 30711 
documents, so approximation in this case only adds overhead.

> two phase intersection
> ----------------------
>
>                 Key: LUCENE-6198
>                 URL: https://issues.apache.org/jira/browse/LUCENE-6198
>             Project: Lucene - Core
>          Issue Type: Improvement
>            Reporter: Robert Muir
>         Attachments: LUCENE-6198.patch, LUCENE-6198.patch, LUCENE-6198.patch, 
> LUCENE-6198.patch, LUCENE-6198.patch, phrase_intersections.tasks
>
>
> Currently some scorers have to do a lot of per-document work to determine if 
> a document is a match. The simplest example is a phrase scorer, but there are 
> others (spans, sloppy phrase, geospatial, etc).
> Imagine a conjunction with two MUST clauses, one that is a term that matches 
> all odd documents, another that is a phrase matching all even documents. 
> Today this conjunction will be very expensive, because the zig-zag 
> intersection is reading a ton of useless positions.
> The same problem happens with filteredQuery and anything else that acts like 
> a conjunction.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Comment Edited] (LUCENE-6198) two phase intersection

Reply via email to