[GitHub] [lucene-solr] jpountz opened a new pull request #1126: LUCENE-5201: Terms dictionary compression.

GitBox Thu, 26 Dec 2019 05:18:25 -0800

jpountz opened a new pull request #1126: LUCENE-5201: Terms dictionary 
compression.
URL: https://github.com/apache/lucene-solr/pull/1126
 
 
   Compress blocks of suffixes in order to make the terms dictionary more
   space-efficient. Two compression algorithms are used depending on which one 
is
   more space-efficient:
    - LowercaseAsciiCompression, which applies when all bytes are in the
      `[0x1F,0x3F)` or `[0x5F,0x7F)` ranges, which notably include all digits,
      lowercase ASCII characters, '.', '-' and '_', and encodes 4 chars on 3 
bytes.
      It is very often applicable on analyzed content and decompresses very 
quickly
      thanks to auto-vectorization support in the JVM.
    - LZ4, when the compression ratio is less than 0.75.
   
   I was a bit unhappy with the complexity of the high-compression LZ4 option, 
so
   I simplified it in order to only keep the logic that detects duplicate 
strings.
   The logic about what to do in case overlapping matches are found, which was
   responsible for most of the complexity while only yielding tiny benefits, has
   been removed.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene-solr] jpountz opened a new pull request #1126: LUCENE-5201: Terms dictionary compression.

Reply via email to