64, etc.)

Han Jiang (JIRA) Tue, 07 Aug 2012 20:35:17 -0700

    [ 
https://issues.apache.org/jira/browse/LUCENE-3892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13396987#comment-13396987
 ]


Han Jiang edited comment on LUCENE-3892 at 8/8/12 3:34 AM:
-----------------------------------------------------------

Oh, thank you Mike! I haven't thought too much about those skipping policies.

bq. Up above, in ForFactory, when we readInt() to get numBytes ... it seems 
like we could stuff the header numBits into that same int and save checking 
that in FORUtil.decompress....
Ah, yes, I just forgot to remove the redundant codes. Here is a initial try to 
remove header and call ForDecompressImpl directly in readBlock():with For, 
blockSize=128. Data in bracket show prior benchmark.
{noformat}
                Task    QPS Base StdDev Base     QPS For  StdDev For      Pct 
diff
              Phrase        4.99        0.37        3.57        0.26  -38% -  
-17% (-44% -  -18%)
          AndHighMed       28.91        2.17       22.66        0.82  -29% -  
-12% (-38% -   -9%)
            SpanNear        2.72        0.14        2.22        0.13  -26% -   
-8% (-36% -   -8%)
        SloppyPhrase        4.24        0.26        3.70        0.16  -21% -   
-3% (-33% -   -6%)
             Respell       40.71        2.59       37.66        1.36  -16% -    
2% (-18% -    0%)
              Fuzzy1       43.22        2.01       40.66        0.32  -10% -    
0% (-12% -    0%)
              Fuzzy2       16.25        0.90       15.64        0.26  -10% -    
3% (-12% -    3%)
            Wildcard       19.07        0.86       19.07        0.73   -8% -    
8% (-21% -    3%)
         AndHighHigh        7.76        0.47        7.77        0.15   -7% -    
8% (-21% -   10%)
            PKLookup       87.50        4.56       88.51        1.24   -5% -    
8% ( -2% -    5%)
        TermBGroup1M       20.42        0.87       21.32        0.74   -3% -   
12% (  2% -   10%)
           OrHighMed        5.33        0.68        5.61        0.14   -9% -   
23% (-16% -   25%)
          OrHighHigh        4.43        0.53        4.69        0.12   -8% -   
23% (-15% -   24%)
         TermGroup1M       13.30        0.34       14.31        0.40    2% -   
13% (  0% -   13%)
      TermBGroup1M1P       20.92        0.59       23.71        0.86    6% -   
20% ( -1% -   22%)
             Prefix3       30.30        1.41       35.14        1.76    5% -   
27% (-14% -   21%)
              IntNRQ        3.90        0.54        4.58        0.47   -7% -   
50% (-25% -   33%)
                Term       42.17        1.55       52.33        2.57   13% -   
35% (  1% -   33%)
{noformat}
-The improvement is quite general. However, I still suppose this just benefits 
from less method calling. I'm trying to change the PFor codes, and remove those 
nested call.- (this is not actually true, since I was using percentage diff 
instead of QPS during comparison)

bq. Get more direct access to the file as an int[]; ...
Ok, this will be considered when the pfor+pulsing is completed. I'm just 
curious why we don't have readInts in ora.util yet...

bq. Skipping: can we partially decode a block? ...
The pfor-opt approach(encode lower bits of exception in normal area, and other 
bits in exception area)  natually fits "partially decode a block", that'll be 
possible when we optimize skipping queries.
                
      was (Author: billy):
    Oh, thank you Mike! I haven't thought too much about those skipping 
policies.

bq. Up above, in ForFactory, when we readInt() to get numBytes ... it seems 
like we could stuff the header numBits into that same int and save checking 
that in FORUtil.decompress....
Ah, yes, I just forgot to remove the redundant codes. Here is a initial try to 
remove header and call ForDecompressImpl directly in readBlock():with For, 
blockSize=128. Data in bracket show prior benchmark.
{noformat}
                Task    QPS Base StdDev Base     QPS For  StdDev For      Pct 
diff
              Phrase        4.99        0.37        3.57        0.26  -38% -  
-17% (-44% -  -18%)
          AndHighMed       28.91        2.17       22.66        0.82  -29% -  
-12% (-38% -   -9%)
            SpanNear        2.72        0.14        2.22        0.13  -26% -   
-8% (-36% -   -8%)
        SloppyPhrase        4.24        0.26        3.70        0.16  -21% -   
-3% (-33% -   -6%)
             Respell       40.71        2.59       37.66        1.36  -16% -    
2% (-18% -    0%)
              Fuzzy1       43.22        2.01       40.66        0.32  -10% -    
0% (-12% -    0%)
              Fuzzy2       16.25        0.90       15.64        0.26  -10% -    
3% (-12% -    3%)
            Wildcard       19.07        0.86       19.07        0.73   -8% -    
8% (-21% -    3%)
         AndHighHigh        7.76        0.47        7.77        0.15   -7% -    
8% (-21% -   10%)
            PKLookup       87.50        4.56       88.51        1.24   -5% -    
8% ( -2% -    5%)
        TermBGroup1M       20.42        0.87       21.32        0.74   -3% -   
12% (  2% -   10%)
           OrHighMed        5.33        0.68        5.61        0.14   -9% -   
23% (-16% -   25%)
          OrHighHigh        4.43        0.53        4.69        0.12   -8% -   
23% (-15% -   24%)
         TermGroup1M       13.30        0.34       14.31        0.40    2% -   
13% (  0% -   13%)
      TermBGroup1M1P       20.92        0.59       23.71        0.86    6% -   
20% ( -1% -   22%)
             Prefix3       30.30        1.41       35.14        1.76    5% -   
27% (-14% -   21%)
              IntNRQ        3.90        0.54        4.58        0.47   -7% -   
50% (-25% -   33%)
                Term       42.17        1.55       52.33        2.57   13% -   
35% (  1% -   33%)
{noformat}
The improvement is quite general. However, I still suppose this just benefits 
from less method calling. I'm trying to change the PFor codes, and remove those 
nested call.

bq. Get more direct access to the file as an int[]; ...
Ok, this will be considered when the pfor+pulsing is completed. I'm just 
curious why we don't have readInts in ora.util yet...

bq. Skipping: can we partially decode a block? ...
The pfor-opt approach(encode lower bits of exception in normal area, and other 
bits in exception area)  natually fits "partially decode a block", that'll be 
possible when we optimize skipping queries.
                  
> Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, 
> Simple9/16/64, etc.)
> -------------------------------------------------------------------------------------
>
>                 Key: LUCENE-3892
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3892
>             Project: Lucene - Core
>          Issue Type: Improvement
>            Reporter: Michael McCandless
>              Labels: gsoc2012, lucene-gsoc-12
>             Fix For: 4.1
>
>         Attachments: LUCENE-3892-BlockTermScorer.patch, 
> LUCENE-3892-blockFor&hardcode(base).patch, 
> LUCENE-3892-blockFor&packedecoder(comp).patch, 
> LUCENE-3892-blockFor-with-packedints-decoder.patch, 
> LUCENE-3892-blockFor-with-packedints-decoder.patch, 
> LUCENE-3892-blockFor-with-packedints.patch, 
> LUCENE-3892-direct-IntBuffer.patch, LUCENE-3892-for&pfor-with-javadoc.patch, 
> LUCENE-3892-handle_open_files.patch, 
> LUCENE-3892-pfor-compress-iterate-numbits.patch, 
> LUCENE-3892-pfor-compress-slow-estimate.patch, LUCENE-3892_for_byte[].patch, 
> LUCENE-3892_for_int[].patch, LUCENE-3892_for_unfold_method.patch, 
> LUCENE-3892_pfor_unfold_method.patch, LUCENE-3892_pulsing_support.patch, 
> LUCENE-3892_settings.patch, LUCENE-3892_settings.patch
>
>
> On the flex branch we explored a number of possible intblock
> encodings, but for whatever reason never brought them to completion.
> There are still a number of issues opened with patches in different
> states.
> Initial results (based on prototype) were excellent (see
> http://blog.mikemccandless.com/2010/08/lucene-performance-with-pfordelta-codec.html
> ).
> I think this would make a good GSoC project.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Comment Edited] (LUCENE-3892) Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, Simple9/16/64, etc.)

Reply via email to