It's the same approach to variable-length encoding, yes. Zig-zag is a trick to make negative numbers "compatible" with this encoding. Because two's-complement negative numbers start with a bunch of 1s their representation is terrible under this variable-length encoding -- always of maximum length. Zig-zag defines a simple two-way mapping between all integers and the nonnegative integers, so you can encode even negative values as positive values, and encode those positive values effectively.
I think you're touching on a good point -- that the logical conclusion of these compression ideas is yet further out. And, the best encoding probably does about as well as a good general algorithm like LZ{W,O}. And one can already tell Hadoop to write out compressed sequence files. So, why not just leverage that? My gut says it's not a step too far to implement variable-length encoding. It's a transformation that mostly chops out zero bytes, which only aids later compression. It's not as if it's a transformation that then compresses worse. Once I get done with MAHOUT-302 (going to be another classic doozy of a patch from me) I am happy to look into the real effects of this. On Sun, May 2, 2010 at 8:45 PM, Drew Farris <drew.far...@gmail.com> wrote: > Is this what is commonly referred to as zig-zag encoding? Avro uses the same > technique: > http://hadoop.apache.org/avro/docs/1.3.2/spec.html#binary_encoding > > For sequential sparse vectors it we could get an additional win by delta > encoding the indexes. This would allow the index, stored as the difference > from the previous index, to be kept to two bytes in many cases. > > Regardless, vint encoding will produce a significant space savings and > Sean's right: it has also been my experience that space savings often trump > speed simply because of the speed of network or storage. > > Do anyone have any idea whether greater gains to be found by finely tuning > the base encoding vs. relying on some form of SequenceFile block > compression? (or do both approaches compliment each other nicely?)