markharwood commented on issue #1234: Add compression for Binary doc value 
fields
URL: https://github.com/apache/lucene-solr/pull/1234#issuecomment-582367395
 
 
   Thanks for looking at this, Mike.
   
   >LOL, that's crazy -- you should go introduce yourself to the other markh ;)
   
   I already reached out and we're working out the divorce proceedings :)
   
   >@markharwood how can we reproduce these benchmarks? What were the log data 
documents storing as BINARY doc values fields?
   
   These were elasticsearch log file entries - so each value was a string which 
could be something short like  `[instance-0000000048] users file 
[/app/config/users] changed. updating users... )` or an error with a whole 
stack trace.
   My test rig is 
[here](https://gist.github.com/markharwood/724009754c89e7f245625120e71f60d7) if 
you want to try with some other data files
   
   >And how can indexing and searching get so much faster when 
compress/decompress is in the path!
   
   This was a test on my macbook with SSD and encrypted FS so perhaps not the 
best benchmarking setup. Maybe just writing more bytes = more overhead with the 
OS-level encryption?
   
   >I think our testing of BINARY doc values may not be great ... maybe add a 
randomized test that sometimes stores very compressible and very 
incompressible, large, BINARY doc values?
   
   Will do. @jimczi has suggested adding support for storing without 
compression when the content doesn't compress well. I guess that can be a 
combination of :
   1) A fast heuristic - e.g. if max value length for each of the docs in a 
block <=2 then store without compression and 
   2) "Try it and see" compression - buffer compression output to byte array 
and only write compressed form to disk if size is less than the uncompressed 
input
   
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Reply via email to