markharwood commented on issue #1234: Add compression for Binary doc value fields URL: https://github.com/apache/lucene-solr/pull/1234#issuecomment-582367395 Thanks for looking at this, Mike. >LOL, that's crazy -- you should go introduce yourself to the other markh ;) I already reached out and we're working out the divorce proceedings :) >@markharwood how can we reproduce these benchmarks? What were the log data documents storing as BINARY doc values fields? These were elasticsearch log file entries - so each value was a string which could be something short like `[instance-0000000048] users file [/app/config/users] changed. updating users... )` or an error with a whole stack trace. My test rig is [here](https://gist.github.com/markharwood/724009754c89e7f245625120e71f60d7) if you want to try with some other data files >And how can indexing and searching get so much faster when compress/decompress is in the path! This was a test on my macbook with SSD and encrypted FS so perhaps not the best benchmarking setup. Maybe just writing more bytes = more overhead with the OS-level encryption? >I think our testing of BINARY doc values may not be great ... maybe add a randomized test that sometimes stores very compressible and very incompressible, large, BINARY doc values? Will do. @jimczi has suggested adding support for storing without compression when the content doesn't compress well. I guess that can be a combination of : 1) A fast heuristic - e.g. if max value length for each of the docs in a block <=2 then store without compression and 2) "Try it and see" compression - buffer compression output to byte array and only write compressed form to disk if size is less than the uncompressed input
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org