[jira] [Updated] (LUCENE-5938) New DocIdSet implementation with random write access

Adrien Grand (JIRA) Mon, 29 Sep 2014 06:29:48 -0700

     [ 
https://issues.apache.org/jira/browse/LUCENE-5938?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Adrien Grand updated LUCENE-5938:
---------------------------------
    Attachment: LUCENE-5938.patch

I updated the patch to recent trunk modifications and ran the benchmark again, 
I think it is ready. In summary this patch:
 - introduces a new doc-id set impl similar to FixedBitSet but which is much 
faster in the sparse case and a bit slower in the dense case (between 1.5x and 
4x slower according to benchmarks).
 - introduces a doc-id set builder that supports random write access by 
starting with a sparse bit set and upgrading to a dense FixedBitSet when the 
cardinality of the set increases
 - changes MultiTermQueryWrapperFilter and TermsFilter to use this new builder
 - removes CONSTANT_SCORE_AUTO_REWRITE and makes CONSTANT_SCORE_FILTER_REWRITE 
the default

For queries that match many documents ({{wikimedium10m.tasks}}, the builder 
always ends up building a FixedBitSet), this new patch can be slower or faster 
depending on the cost to iterate the matching terms (since they are enumerated 
only once now) vs. the cost of building the doc-id set.

For queries that match few documents ({{low_freq.tasks}}, attached to this 
issue), this new patch makes things faster since it just sets a couple of bits 
in random order and then iterates over them instead of merging documents coming 
from several other iterators on the fly using a priority queue.

Independently of the benchmarks, I think it's a good simplification to remove 
the constant-score auto rewrite mode.

{noformat}
wikimedium10m.tasks

                    TaskQPS baseline      StdDev   QPS patch      StdDev        
        Pct diff
                  IntNRQ        8.79      (9.6%)        8.41      (3.4%)   
-4.3% ( -15% -    9%)
                  Fuzzy2       60.83     (11.1%)       58.34      (8.7%)   
-4.1% ( -21% -   17%)
            OrNotHighMed       98.35     (13.8%)       97.12     (10.9%)   
-1.3% ( -22% -   27%)
           OrHighNotHigh       18.88     (13.7%)       18.71     (11.1%)   
-0.9% ( -22% -   27%)
           OrNotHighHigh       17.10     (13.4%)       17.03     (11.2%)   
-0.4% ( -22% -   27%)
            OrNotHighLow      126.52     (13.6%)      126.85     (10.9%)    
0.3% ( -21% -   28%)
               OrHighMed       76.90     (14.0%)       77.25     (11.4%)    
0.5% ( -21% -   30%)
            OrHighNotLow       41.29     (14.3%)       41.51     (12.4%)    
0.5% ( -22% -   31%)
            OrHighNotMed       57.70     (13.6%)       58.03     (11.6%)    
0.6% ( -21% -   29%)
               OrHighLow       73.14     (14.7%)       73.74     (12.0%)    
0.8% ( -22% -   32%)
         LowSloppyPhrase      127.22      (8.6%)      128.43      (3.8%)    
1.0% ( -10% -   14%)
              OrHighHigh       29.11     (15.1%)       29.41     (12.2%)    
1.0% ( -22% -   33%)
        HighSloppyPhrase       12.83     (10.4%)       13.02      (5.3%)    
1.4% ( -12% -   19%)
                 Prefix3      113.92      (9.9%)      115.71      (2.4%)    
1.6% (  -9% -   15%)
                PKLookup      297.13      (9.2%)      302.03      (3.5%)    
1.6% ( -10% -   15%)
         MedSloppyPhrase       38.60      (8.8%)       39.26      (3.7%)    
1.7% (  -9% -   15%)
             AndHighHigh       71.39      (6.9%)       72.67      (0.9%)    
1.8% (  -5% -   10%)
                HighTerm       87.17      (7.9%)       88.85      (2.1%)    
1.9% (  -7% -   12%)
               MedPhrase       74.60      (9.3%)       76.10      (4.3%)    
2.0% ( -10% -   17%)
               LowPhrase       21.67      (9.6%)       22.12      (4.0%)    
2.1% ( -10% -   17%)
              AndHighMed      297.13      (9.4%)      303.73      (2.1%)    
2.2% (  -8% -   15%)
              HighPhrase       16.65      (8.2%)       17.04      (3.7%)    
2.3% (  -8% -   15%)
            HighSpanNear        4.56     (10.7%)        4.67      (6.1%)    
2.4% ( -12% -   21%)
                 LowTerm      769.53      (7.8%)      788.24      (2.0%)    
2.4% (  -6% -   13%)
              AndHighLow      726.76     (10.6%)      744.51      (4.2%)    
2.4% ( -11% -   19%)
             MedSpanNear       65.27      (9.3%)       67.00      (3.2%)    
2.6% (  -9% -   16%)
                Wildcard      114.28      (9.1%)      118.05      (7.4%)    
3.3% ( -12% -   21%)
             LowSpanNear      174.75     (10.3%)      180.83      (3.5%)    
3.5% (  -9% -   19%)
                  Fuzzy1       67.63     (11.3%)       70.08      (3.2%)    
3.6% (  -9% -   20%)
                 MedTerm      209.00      (9.3%)      216.71      (1.9%)    
3.7% (  -6% -   16%)
                 Respell       48.30     (10.6%)       50.58      (1.7%)    
4.7% (  -6% -   18%)

low_freq.tasks

                    TaskQPS baseline      StdDev   QPS patch      StdDev        
        Pct diff
                PKLookup      278.50      (8.8%)      297.48     (13.9%)    
6.8% ( -14% -   32%)
                Wildcard      124.50      (7.9%)      250.26     (19.3%)  
101.0% (  68% -  139%)
{noformat}

> New DocIdSet implementation with random write access
> ----------------------------------------------------
>
>                 Key: LUCENE-5938
>                 URL: https://issues.apache.org/jira/browse/LUCENE-5938
>             Project: Lucene - Core
>          Issue Type: Improvement
>            Reporter: Adrien Grand
>            Assignee: Adrien Grand
>         Attachments: LUCENE-5938.patch, LUCENE-5938.patch, LUCENE-5938.patch, 
> LUCENE-5938.patch, low_freq.tasks
>
>
> We have a great cost API that is supposed to help make decisions about how to 
> best execute queries. However, due to the fact that several of our filter 
> implementations (eg. TermsFilter and BooleanFilter) return FixedBitSets, 
> either we use the cost API and make bad decisions, or need to fall back to 
> heuristics which are not as good such as 
> RandomAccessFilterStrategy.useRandomAccess which decides that random access 
> should be used if the first doc in the set is less than 100.
> On the other hand, we also have some nice compressed and cacheable DocIdSet 
> implementation but we cannot make use of them because TermsFilter requires a 
> DocIdSet that has random write access, and FixedBitSet is the only DocIdSet 
> that we have that supports random access.
> I think it would be nice to replace FixedBitSet in those filters with another 
> DocIdSet that would also support random write access but would have a better 
> cost?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (LUCENE-5938) New DocIdSet implementation with random write access

Reply via email to