[jira] [Commented] (LUCENE-4702) Terms dictionary compression

Adrien Grand (Jira) Fri, 03 Jan 2020 01:37:27 -0800


    [ 
https://issues.apache.org/jira/browse/LUCENE-4702?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17007358#comment-17007358
 ]


Adrien Grand commented on LUCENE-4702:
--------------------------------------

Indeed I had disabled PKLookup for another benchmark and forgot to enable it 
again. Let me run the benchmark now. The slowdown is going to depend on the 
format of ids... For random binary ids like uuids, I expect suffixes won't be 
compressed, so the lookup rate should remain almost the same. For compressible 
ids like Flake that likely get compressed with LZ4 there will be a noticeable 
slowdown. Ids of the nightly benchmarks are base36 integers, so they should 
compress with the new "lowercase ascii" compression that was introduced in this 
change, which should give a slowdown, but not as much as with LZ4.

Here are the stats:
{noformat}
  index FST:
    72 bytes
  terms:
    6647577 terms
    39885462 bytes (6.0 bytes/term)
  blocks:
    189932 blocks
    184655 terms-only blocks
    5277 sub-block-only blocks
    0 mixed blocks
    0 floor blocks
    189932 non-floor blocks
    0 floor sub-blocks
    14059998 term suffix bytes before compression (69.0 suffix-bytes/block)
    13110492 compressed term suffix bytes (0.93 compression ratio - compression 
count by algorithm: LOWERCASE_ASCII: 189932)
    6647577 term stats bytes (35.0 stats-bytes/block)
    32616851 other bytes (171.7 other-bytes/block)
    by prefix length:
       0: 1
       2: 4
       3: 143
       4: 5129
       5: 184655
{noformat}

The compression ratio is small because we spend as many bytes on storing suffix 
lengths (though most of them are the same! we should fix it) and actual 
suffixes. Here are the benchmark results:

{noformat}
                    TaskQPS baseline      StdDev   QPS patch      StdDev        
        Pct diff
                 Respell      188.47      (6.5%)      154.44      (4.8%)  
-18.1% ( -27% -   -7%)
                Wildcard      145.26      (4.7%)      132.99      (3.4%)   
-8.4% ( -15% -    0%)
                PKLookup      201.32      (3.2%)      187.85      (2.8%)   
-6.7% ( -12% -    0%)
                  Fuzzy2       48.07      (3.6%)       46.37      (2.9%)   
-3.5% (  -9% -    3%)
                    Term     1347.18      (4.5%)     1327.14      (4.3%)   
-1.5% (  -9% -    7%)
                 Prefix3       58.29      (6.3%)       57.53      (6.4%)   
-1.3% ( -13% -   12%)
            TermBGroup1M       45.57      (6.0%)       45.08      (6.6%)   
-1.1% ( -12% -   12%)
             TermGroup1M       28.98      (6.6%)       28.79      (7.0%)   
-0.7% ( -13% -   13%)
        AndMedOrHighHigh       34.19      (3.4%)       33.98      (3.3%)   
-0.6% (  -7% -    6%)
               OrHighMed       43.14      (3.1%)       42.96      (3.1%)   
-0.4% (  -6% -    6%)
                  Phrase       24.08      (4.3%)       23.99      (4.3%)   
-0.4% (  -8% -    8%)
                  Fuzzy1       44.45      (2.7%)       44.32      (2.4%)   
-0.3% (  -5% -    4%)
       TermDayOfYearSort       67.17      (6.3%)       66.99      (6.3%)   
-0.3% ( -12% -   13%)
              AndHighMed       39.44      (4.1%)       39.33      (4.1%)   
-0.3% (  -8% -    8%)
             AndHighHigh       35.90      (4.3%)       35.82      (4.1%)   
-0.2% (  -8% -    8%)
            TermGroup10K       45.35      (6.2%)       45.30      (6.5%)   
-0.1% ( -12% -   13%)
        IntervalsOrdered        0.98      (8.1%)        0.98      (8.0%)   
-0.1% ( -14% -   17%)
          TermBGroup1M1P       31.02      (3.4%)       31.02      (3.4%)   
-0.0% (  -6% -    7%)
                  IntNRQ      125.94      (7.6%)      125.93      (6.1%)   
-0.0% ( -12% -   14%)
         AndHighOrMedMed       37.36      (2.1%)       37.37      (1.8%)    
0.0% (  -3% -    4%)
            TermGroup100       25.90      (2.6%)       25.92      (2.8%)    
0.1% (  -5% -    5%)
              OrHighHigh       12.01      (2.8%)       12.02      (2.8%)    
0.1% (  -5% -    5%)
            SloppyPhrase        2.77      (1.3%)        2.78      (1.4%)    
0.2% (  -2% -    2%)
              TermDTSort       68.26      (4.2%)       68.39      (3.5%)    
0.2% (  -7% -    8%)
                SpanNear       10.99      (1.3%)       11.02      (1.5%)    
0.2% (  -2% -    3%)
           TermMonthSort       40.55      (2.4%)       40.69      (1.4%)    
0.3% (  -3% -    4%)
           TermTitleSort       39.74      (2.1%)       39.92      (1.2%)    
0.4% (  -2% -    3%)
{noformat}

> Terms dictionary compression
> ----------------------------
>
>                 Key: LUCENE-4702
>                 URL: https://issues.apache.org/jira/browse/LUCENE-4702
>             Project: Lucene - Core
>          Issue Type: Wish
>            Reporter: Adrien Grand
>            Assignee: Adrien Grand
>            Priority: Trivial
>         Attachments: LUCENE-4702.patch, LUCENE-4702.patch
>
>          Time Spent: 3h 10m
>  Remaining Estimate: 0h
>
> I've done a quick test with the block tree terms dictionary by replacing a 
> call to IndexOutput.writeBytes to write suffix bytes with a call to 
> LZ4.compressHC to test the peformance hit. Interestingly, search performance 
> was very good (see comparison table below) and the tim files were 14% smaller 
> (from 150432 bytes overall to 129516).
> {noformat}
>                     TaskQPS baseline      StdDevQPS compressed      StdDev    
>             Pct diff
>                   Fuzzy1      111.50      (2.0%)       78.78      (1.5%)  
> -29.4% ( -32% -  -26%)
>                   Fuzzy2       36.99      (2.7%)       28.59      (1.5%)  
> -22.7% ( -26% -  -18%)
>                  Respell      122.86      (2.1%)      103.89      (1.7%)  
> -15.4% ( -18% -  -11%)
>                 Wildcard      100.58      (4.3%)       94.42      (3.2%)   
> -6.1% ( -13% -    1%)
>                  Prefix3      124.90      (5.7%)      122.67      (4.7%)   
> -1.8% ( -11% -    9%)
>                OrHighLow      169.87      (6.8%)      167.77      (8.0%)   
> -1.2% ( -15% -   14%)
>                  LowTerm     1949.85      (4.5%)     1929.02      (3.4%)   
> -1.1% (  -8% -    7%)
>               AndHighLow     2011.95      (3.5%)     1991.85      (3.3%)   
> -1.0% (  -7% -    5%)
>               OrHighHigh      155.63      (6.7%)      154.12      (7.9%)   
> -1.0% ( -14% -   14%)
>              AndHighHigh      341.82      (1.2%)      339.49      (1.7%)   
> -0.7% (  -3% -    2%)
>                OrHighMed      217.55      (6.3%)      216.16      (7.1%)   
> -0.6% ( -13% -   13%)
>                   IntNRQ       53.10     (10.9%)       52.90      (8.6%)   
> -0.4% ( -17% -   21%)
>                  MedTerm      998.11      (3.8%)      994.82      (5.6%)   
> -0.3% (  -9% -    9%)
>              MedSpanNear       60.50      (3.7%)       60.36      (4.8%)   
> -0.2% (  -8% -    8%)
>             HighSpanNear       19.74      (4.5%)       19.72      (5.1%)   
> -0.1% (  -9% -    9%)
>              LowSpanNear      101.93      (3.2%)      101.82      (4.4%)   
> -0.1% (  -7% -    7%)
>               AndHighMed      366.18      (1.7%)      366.93      (1.7%)    
> 0.2% (  -3% -    3%)
>                 PKLookup      237.28      (4.0%)      237.96      (4.2%)    
> 0.3% (  -7% -    8%)
>                MedPhrase      173.17      (4.7%)      174.69      (4.7%)    
> 0.9% (  -8% -   10%)
>          LowSloppyPhrase      180.91      (2.6%)      182.79      (2.7%)    
> 1.0% (  -4% -    6%)
>                LowPhrase      374.64      (5.5%)      379.11      (5.8%)    
> 1.2% (  -9% -   13%)
>                 HighTerm      253.14      (7.9%)      256.97     (11.4%)    
> 1.5% ( -16% -   22%)
>               HighPhrase       19.52     (10.6%)       19.83     (11.0%)    
> 1.6% ( -18% -   25%)
>          MedSloppyPhrase      141.90      (2.6%)      144.11      (2.5%)    
> 1.6% (  -3% -    6%)
>         HighSloppyPhrase       25.26      (4.8%)       25.97      (5.0%)    
> 2.8% (  -6% -   13%)
> {noformat}
> Only queries which are very terms-dictionary-intensive got a performance hit 
> (Fuzzy, Fuzzy2, Respell, Wildcard), other queries including Prefix3 behaved 
> (surprisingly) well.
> Do you think of it as something worth exploring?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-4702) Terms dictionary compression

Reply via email to