markharwood commented on issue #1234: Add compression for Binary doc value 
fields
URL: https://github.com/apache/lucene-solr/pull/1234#issuecomment-583449275
 
 
   There was a suggestion from @jimczi that we fall back to writing raw data if 
content doesn't compress well. I'm not sure this logic is worth developing for 
the reasons outlined below:
   
   I wrote a [compression 
buffer](https://gist.github.com/markharwood/91cc8d96d6611ad97df11f244b1b1d0f) 
to see what the compression algo outputs before deciding whether to write the 
compressed or  raw data to disk.
   I tested with the most uncompressible content I could imagine:
   
       public static void fillRandom(byte[] buffer, int length) {
           for (int i = 0; i < length; i++) {
               buffer[i] =  (byte) (Math.random() * Byte.MAX_VALUE);
           }
       } 
   
   The LZ4 compressed versions of this content were only marginally bigger than 
their raw counterparts (adding 0.4% overhead to the original content e.g. 
96,921 compressed vs 96,541 raw bytes).
   On that basis I'm not sure if it's worth doubling the memory costs of the 
indexing logic (we would require a temporary output buffer that is at least the 
same size as the raw data being compressed) and additional byte shuffling.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Reply via email to