[ 
https://issues.apache.org/jira/browse/LUCENE-4702?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17018228#comment-17018228
 ] 

Adrien Grand commented on LUCENE-4702:
--------------------------------------

I did some research on what's taking space in the terms dictionary, and while 
suffixes take a fair amount of space for text fields, it tends to rather be 
stats for ID fields, so I did a couple changes to also use LZ4 to do some 
run-length encoding for doc freqs (frequent runs of 1s for ids, and 
interestingly there are many runs of 1s for the body field of wikibigall too), 
suffix lengths, which are also frequently the same especially for ID fields 
(always the same for UUID or flake IDs and very little variance for 
auto-increment IDs). Finally we were wasting some space with the pulsing 
optimization too since we kept writing the delta of file pointers in spite of 
the fact that these deltas are almost always zeros for ID fields since we don't 
write postings in the doc file but in the terms dictionary. The compression is 
significantly better now as the size of the tim file goes down by 18% from 
937MB to 767MB. Here are the stats for the body and id fields if you are 
curious:

{code}
"id" field
  index FST:
    72 bytes
  terms:
    6647577 terms
    39885462 bytes (6.0 bytes/term)
  blocks:
    189932 blocks
    184655 terms-only blocks
    5277 sub-block-only blocks
    0 mixed blocks
    0 floor blocks
    189932 non-floor blocks
    0 floor sub-blocks
    14059850 term suffix bytes before compression (52.8 suffix-bytes/block)
    10023973 compressed term suffix bytes (0.71 compression ratio - compression 
count by algorithm: NO_COMPRESSION: 189932)
    6647577 term stats bytes before compression (11.7 stats-bytes/block)
    2226414 compressed term stats bytes (0.33 compression ratio)
    26962631 other bytes (142.0 other-bytes/block)
{code}

{code}
"body" field
  index FST:
    72 bytes
  terms:
    46916528 terms
    595069147 bytes (12.7 bytes/term)
  blocks:
    1507239 blocks
    1158537 terms-only blocks
    471 sub-block-only blocks
    348231 mixed blocks
    318391 floor blocks
    491775 non-floor blocks
    1015464 floor sub-blocks
    359880365 term suffix bytes before compression (196.3 suffix-bytes/block)
    295898442 compressed term suffix bytes (0.82 compression ratio - 
compression count by algorithm: NO_COMPRESSION: 252273, LOWERCASE_ASCII: 
1190011, LZ4: 64955)
    94426201 term stats bytes before compression (45.1 stats-bytes/block)
    68022105 compressed term stats bytes (0.72 compression ratio)
    213996755 other bytes (142.0 other-bytes/block)
{code}
 
I see a 10% slowdown on PKLookup that I'll look into.

> Terms dictionary compression
> ----------------------------
>
>                 Key: LUCENE-4702
>                 URL: https://issues.apache.org/jira/browse/LUCENE-4702
>             Project: Lucene - Core
>          Issue Type: Wish
>            Reporter: Adrien Grand
>            Assignee: Adrien Grand
>            Priority: Trivial
>         Attachments: LUCENE-4702.patch, LUCENE-4702.patch
>
>          Time Spent: 3h 20m
>  Remaining Estimate: 0h
>
> I've done a quick test with the block tree terms dictionary by replacing a 
> call to IndexOutput.writeBytes to write suffix bytes with a call to 
> LZ4.compressHC to test the peformance hit. Interestingly, search performance 
> was very good (see comparison table below) and the tim files were 14% smaller 
> (from 150432 bytes overall to 129516).
> {noformat}
>                     TaskQPS baseline      StdDevQPS compressed      StdDev    
>             Pct diff
>                   Fuzzy1      111.50      (2.0%)       78.78      (1.5%)  
> -29.4% ( -32% -  -26%)
>                   Fuzzy2       36.99      (2.7%)       28.59      (1.5%)  
> -22.7% ( -26% -  -18%)
>                  Respell      122.86      (2.1%)      103.89      (1.7%)  
> -15.4% ( -18% -  -11%)
>                 Wildcard      100.58      (4.3%)       94.42      (3.2%)   
> -6.1% ( -13% -    1%)
>                  Prefix3      124.90      (5.7%)      122.67      (4.7%)   
> -1.8% ( -11% -    9%)
>                OrHighLow      169.87      (6.8%)      167.77      (8.0%)   
> -1.2% ( -15% -   14%)
>                  LowTerm     1949.85      (4.5%)     1929.02      (3.4%)   
> -1.1% (  -8% -    7%)
>               AndHighLow     2011.95      (3.5%)     1991.85      (3.3%)   
> -1.0% (  -7% -    5%)
>               OrHighHigh      155.63      (6.7%)      154.12      (7.9%)   
> -1.0% ( -14% -   14%)
>              AndHighHigh      341.82      (1.2%)      339.49      (1.7%)   
> -0.7% (  -3% -    2%)
>                OrHighMed      217.55      (6.3%)      216.16      (7.1%)   
> -0.6% ( -13% -   13%)
>                   IntNRQ       53.10     (10.9%)       52.90      (8.6%)   
> -0.4% ( -17% -   21%)
>                  MedTerm      998.11      (3.8%)      994.82      (5.6%)   
> -0.3% (  -9% -    9%)
>              MedSpanNear       60.50      (3.7%)       60.36      (4.8%)   
> -0.2% (  -8% -    8%)
>             HighSpanNear       19.74      (4.5%)       19.72      (5.1%)   
> -0.1% (  -9% -    9%)
>              LowSpanNear      101.93      (3.2%)      101.82      (4.4%)   
> -0.1% (  -7% -    7%)
>               AndHighMed      366.18      (1.7%)      366.93      (1.7%)    
> 0.2% (  -3% -    3%)
>                 PKLookup      237.28      (4.0%)      237.96      (4.2%)    
> 0.3% (  -7% -    8%)
>                MedPhrase      173.17      (4.7%)      174.69      (4.7%)    
> 0.9% (  -8% -   10%)
>          LowSloppyPhrase      180.91      (2.6%)      182.79      (2.7%)    
> 1.0% (  -4% -    6%)
>                LowPhrase      374.64      (5.5%)      379.11      (5.8%)    
> 1.2% (  -9% -   13%)
>                 HighTerm      253.14      (7.9%)      256.97     (11.4%)    
> 1.5% ( -16% -   22%)
>               HighPhrase       19.52     (10.6%)       19.83     (11.0%)    
> 1.6% ( -18% -   25%)
>          MedSloppyPhrase      141.90      (2.6%)      144.11      (2.5%)    
> 1.6% (  -3% -    6%)
>         HighSloppyPhrase       25.26      (4.8%)       25.97      (5.0%)    
> 2.8% (  -6% -   13%)
> {noformat}
> Only queries which are very terms-dictionary-intensive got a performance hit 
> (Fuzzy, Fuzzy2, Respell, Wildcard), other queries including Prefix3 behaved 
> (surprisingly) well.
> Do you think of it as something worth exploring?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Reply via email to