[jira] [Commented] (LUCENE-4498) pulse docfreq=1 DOCS_ONLY for 4.1 codec

Michael McCandless (JIRA) Mon, 22 Oct 2012 13:48:13 -0700

    [ 
https://issues.apache.org/jira/browse/LUCENE-4498?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13481768#comment-13481768
 ]


Michael McCandless commented on LUCENE-4498:
--------------------------------------------

Looks good:

{noformat}
                    Task    QPS base      StdDev    QPS comp      StdDev        
        Pct diff
                 Respell       86.70      (3.0%)       84.04      (2.6%)   
-3.1% (  -8% -    2%)
               OrHighMed       41.52      (5.8%)       40.44      (6.1%)   
-2.6% ( -13% -    9%)
               OrHighLow       25.43      (6.0%)       24.77      (6.4%)   
-2.6% ( -14% -   10%)
              OrHighHigh        9.38      (5.9%)        9.15      (6.4%)   
-2.5% ( -14% -   10%)
                Wildcard       93.94      (4.1%)       92.36      (2.0%)   
-1.7% (  -7% -    4%)
                 MedTerm      211.10     (12.3%)      208.78     (13.4%)   
-1.1% ( -23% -   27%)
                  IntNRQ       10.74     (11.3%)       10.62      (7.8%)   
-1.1% ( -18% -   20%)
                HighTerm       25.59     (14.0%)       25.35     (15.0%)   
-1.0% ( -26% -   32%)
             MedSpanNear       13.77      (2.3%)       13.68      (1.6%)   
-0.7% (  -4% -    3%)
        HighSloppyPhrase        4.09      (5.4%)        4.07      (5.2%)   
-0.5% ( -10% -   10%)
            HighSpanNear        6.84      (2.9%)        6.81      (2.1%)   
-0.4% (  -5% -    4%)
                 Prefix3       17.81      (5.7%)       17.74      (1.5%)   
-0.4% (  -7% -    7%)
                  Fuzzy1       77.54      (2.5%)       77.25      (2.7%)   
-0.4% (  -5% -    4%)
              AndHighLow      719.17      (2.7%)      716.49      (2.3%)   
-0.4% (  -5% -    4%)
                  Fuzzy2       68.94      (2.4%)       68.69      (2.8%)   
-0.4% (  -5% -    5%)
             LowSpanNear       12.89      (1.8%)       12.85      (1.3%)   
-0.3% (  -3% -    2%)
         MedSloppyPhrase       29.92      (3.4%)       29.85      (3.4%)   
-0.2% (  -6% -    6%)
                 LowTerm      500.58      (5.9%)      500.52      (7.0%)   
-0.0% ( -12% -   13%)
         LowSloppyPhrase        9.57      (4.4%)        9.60      (4.3%)    
0.4% (  -7% -    9%)
               LowPhrase        9.64      (2.8%)        9.70      (3.0%)    
0.7% (  -4% -    6%)
              AndHighMed       86.68      (1.2%)       87.26      (1.2%)    
0.7% (  -1% -    3%)
               MedPhrase        7.07      (4.3%)        7.15      (4.6%)    
1.1% (  -7% -   10%)
              HighPhrase        4.79      (4.8%)        4.84      (5.6%)    
1.1% (  -8% -   12%)
             AndHighHigh       25.81      (1.7%)       26.20      (1.2%)    
1.5% (  -1% -    4%)
                PKLookup      193.31      (2.1%)      204.74      (1.6%)    
5.9% (   2% -    9%)
{noformat}

                
> pulse docfreq=1 DOCS_ONLY for 4.1 codec
> ---------------------------------------
>
>                 Key: LUCENE-4498
>                 URL: https://issues.apache.org/jira/browse/LUCENE-4498
>             Project: Lucene - Core
>          Issue Type: Improvement
>          Components: core/codecs
>            Reporter: Robert Muir
>         Attachments: LUCENE-4498_lazy.patch, LUCENE-4498.patch, 
> LUCENE-4498.patch
>
>
> We have pulsing codec, but currently this has some downsides:
> * its very general, wrapping an arbitrary postingsformat and pulsing 
> everything in the postings for an arbitrary docfreq/totalTermFreq cutoff
> * reuse is hairy: because it specializes its enums based on these cutoffs, 
> when walking thru terms e.g. merging there is a lot of sophisticated stuff to 
> avoid the worst cases where we clone indexinputs for tons of terms.
> On the other hand the way the 4.1 codec encodes "primary key" fields is 
> pretty silly, we write the docStartFP vlong in the term dictionary metadata, 
> which tells us where to seek in the .doc to read our one lonely vint.
> I think its worth investigating that in the DOCS_ONLY docfreq=1 case, we just 
> write the lone doc delta where we would write docStartFP. 
> We can avoid the hairy reuse problem too, by just supporting this in 
> refillDocs() in BlockDocsEnum instead of specializing.
> This would remove the additional seek for "primary key" fields without really 
> any of the downsides of pulsing today.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (LUCENE-4498) pulse docfreq=1 DOCS_ONLY for 4.1 codec

Reply via email to