PE has short and unique keys, so any prefix encoding won't buy much (or make it worse).
What's interesting to me is the difference between snappy and lzo, I expected them to be mostly equivalent in terms of compression. So as a general guideline I'd say: o If you have long keys (compared to the values) or many columns, use a prefix encoder. Only use FAST_DIFF. o If the values are large (and not precompressed as in images), use a block compressor (SNAPPY, LZO, GZIP, etc) o Use GZIP for cold data o Use SNAPPY or LZO for hot data. o In most cases you do want to enable SNAPPY or LZO by default (low perf overhead + space savings). -- Lars ________________________________ From: Nick Dimiduk <ndimi...@gmail.com> To: hbase-dev <dev@hbase.apache.org> Sent: Wednesday, September 11, 2013 12:10 PM Subject: Documenting Guidance on compression and codecs Do we have a consolidated resource with information and recommendations about use of the above? For instance, I ran a simple test using PerformanceEvaluation, examining just the size of data on disk for 1G of input data. The matrix below has some surprising results: +--------------------+--------------+ | MODIFIER | SIZE (bytes) | +--------------------+--------------+ | none | 1108553612 | +--------------------+--------------+ | compression:SNAPPY | 427335534 | +--------------------+--------------+ | compression:LZO | 270422088 | +--------------------+--------------+ | compression:GZ | 152899297 | +--------------------+--------------+ | codec:PREFIX | 1993910969 | +--------------------+--------------+ | codec:DIFF | 1960970083 | +--------------------+--------------+ | codec:FAST_DIFF | 1061374722 | +--------------------+--------------+ | codec:PREFIX_TREE | 1066586604 | +--------------------+--------------+ Where does a wayward soul look for guidance on which combination of the above to choose for their application? Thanks, Nick