On May 3, 9:38 pm, Paul Rubin <no.em...@nospam.invalid> wrote: > Steve Howell <showel...@yahoo.com> writes: > > My test was to write roughly 4GB of data, with 2 million keys of 2k > > bytes each. > > If the records are something like english text, you can compress > them with zlib and get some compression gain by pre-initializing > a zlib dictionary from a fixed english corpus, then cloning it. > That is, if your messages are a couple paragraphs, you might > say something like: > > iv = (some fixed 20k or so of records concatenated together) > compressor = zlib(iv).clone() # I forget what this > # operation is actually called > > # I forget what this is called too, but the idea is you throw > # away the output of compressing the fixed text, and sync > # to a byte boundary > compressor.sync() > > zout = compressor.compress(your_record).sync() > ... > > i.e. the part you save in the file is just the difference > between compress(corpus) and compress(corpus_record). To > decompress, you initialize a compressor the same way, etc. > > It's been a while since I used that trick but for json records of a few > hundred bytes, I remember getting around 2:1 compression, while starting > with an unprepared compressor gave almost no compression.
Sounds like a useful technique. The text snippets that I'm compressing are indeed mostly English words, and 7-bit ascii, so it would be practical to use a compression library that just uses the same good-enough encodings every time, so that you don't have to write the encoding dictionary as part of every small payload. Sort of as you suggest, you could build a Huffman encoding for a representative run of data, save that tree off somewhere, and then use it for all your future encoding/decoding. Is there a name to describe this technique? -- http://mail.python.org/mailman/listinfo/python-list