[jira] [Commented] (LUCENE-2962) Skip data should be inlined into the postings lists

Han Jiang (JIRA) Tue, 23 Apr 2013 04:43:19 -0700

    [ 
https://issues.apache.org/jira/browse/LUCENE-2962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13638971#comment-13638971
 ]


Han Jiang commented on LUCENE-2962:
-----------------------------------

Oh, sorry I didn't made it clear:

All the tests above were already done on wikimediumfull, which is using 
WIKI_MEDIUM_TASKS_10MDOCS_FILE.

The crazyMinShouldMatch benefits much from skipper (as is expected from the 
crazy avg_len :) ), 
and the result is below: 

{noformat}
                    Task    QPS base      StdDev    QPS comp      StdDev        
        Pct diff
       10Terms8High10MSM      322.25      (2.5%)       97.87      (0.9%)  
-69.6% ( -71% -  -67%)
       10Terms4High10MSM      449.00      (2.1%)      194.73      (1.2%)  
-56.6% ( -58% -  -54%)
       10Terms6High10MSM      611.10      (2.6%)      327.45      (1.4%)  
-46.4% ( -49% -  -43%)
       10Terms2High10MSM      614.20      (2.6%)      472.07      (1.9%)  
-23.1% ( -26% -  -19%)
        10Terms6High8MSM       61.24      (5.9%)       56.10      (5.6%)   
-8.4% ( -18% -    3%)
        10Terms4High6MSM      104.63      (4.9%)      100.22      (5.0%)   
-4.2% ( -13% -    5%)
        10Terms4High2MSM        6.31      (7.8%)        6.12      (8.7%)   
-3.0% ( -18% -   14%)
        10Terms6High4MSM        1.75      (6.6%)        1.70      (7.3%)   
-2.9% ( -15% -   11%)
        10Terms2High4MSM       31.74      (6.5%)       30.85      (7.4%)   
-2.8% ( -15% -   11%)
        10Terms2High2MSM        5.30      (7.0%)        5.16      (8.0%)   
-2.6% ( -16% -   13%)
        10Terms8High4MSM        0.87      (5.8%)        0.85      (6.3%)   
-2.4% ( -13% -   10%)
        10Terms0High8MSM      216.98      (4.1%)      211.76      (4.9%)   
-2.4% ( -10% -    6%)
        10Terms6High2MSM        0.92      (5.3%)        0.90      (6.0%)   
-2.3% ( -12% -    9%)
        10Terms2High8MSM      115.45      (4.8%)      113.28      (5.1%)   
-1.9% ( -11% -    8%)
        10Terms4High8MSM      209.93      (4.4%)      206.04      (4.8%)   
-1.9% ( -10% -    7%)
        10Terms8High8MSM       11.03      (6.8%)       10.85      (8.1%)   
-1.7% ( -15% -   14%)
        10Terms6High6MSM        9.30      (6.8%)        9.15      (8.0%)   
-1.7% ( -15% -   14%)
        10Terms0High2MSM       27.76      (6.9%)       27.30      (8.4%)   
-1.6% ( -15% -   14%)
        10Terms4High3MSM        4.34      (7.0%)        4.27      (8.2%)   
-1.6% ( -15% -   14%)
        10Terms8High6MSM        3.06      (7.1%)        3.01      (8.3%)   
-1.5% ( -15% -   14%)
        10Terms8High2MSM        2.33      (6.5%)        2.30      (7.5%)   
-1.2% ( -14% -   13%)
        10Terms4High4MSM        8.77      (6.6%)        8.67      (8.1%)   
-1.2% ( -14% -   14%)
        10Terms0High6MSM       77.21      (5.7%)       76.71      (5.9%)   
-0.7% ( -11% -   11%)
        10Terms2High6MSM       73.82      (5.7%)       73.40      (6.1%)   
-0.6% ( -11% -   11%)
        10Terms0High4MSM       63.80      (5.9%)       63.64      (6.3%)   
-0.2% ( -11% -   12%)
       10Terms0High10MSM      595.12      (2.4%)      595.54      (2.4%)    
0.1% (  -4% -    5%)
                PKLookup      244.34      (3.1%)      259.97      (3.0%)    
6.4% (   0% -   12%)
{noformat}
                
> Skip data should be inlined into the postings lists
> ---------------------------------------------------
>
>                 Key: LUCENE-2962
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2962
>             Project: Lucene - Core
>          Issue Type: Improvement
>          Components: core/index
>            Reporter: Michael McCandless
>              Labels: gsoc2013
>         Attachments: proposal.txt
>
>
> Today, we store all skip data as a separate blob at the end of a given term's 
> postings (if that term occurs in enough docs to warrant skip data).
> But this adds overhead during decoding -- we have to seek to a different 
> place for the initial load, we have to init separate readers, we have to seek 
> again while using the lower levels of the skip data, etc.  Also, we have to 
> fully decode all skip information even if we are not going to use it (eg if I 
> only want docIDs, I still must decode position offset and lastPayloadLength).
> If instead we interleaved skip data into the postings file, we could keep it 
> local, and "private" to each file that needs skipping.  This should make it 
> least costly to init and then use the skip data, which'd be a good perf gain 
> for eg PhraseQuery, AndQuery.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (LUCENE-2962) Skip data should be inlined into the postings lists

Reply via email to