It's the same approach to variable-length encoding, yes. Zig-zag is a
trick to make negative numbers "compatible" with this encoding.
Because two's-complement negative numbers start with a bunch of 1s
their representation is terrible under this variable-length encoding
-- always of maximum length. Zig-zag defines a simple two-way mapping
between all integers and the nonnegative integers, so you can encode
even negative values as positive values, and encode those positive
values effectively.

I think you're touching on a good point -- that the logical conclusion
of these compression ideas is yet further out. And, the best encoding
probably does about as well as a good general algorithm like LZ{W,O}.
And one can already tell Hadoop to write out compressed sequence
files. So, why not just leverage that?

My gut says it's not a step too far to implement variable-length
encoding. It's a transformation that mostly chops out zero bytes,
which only aids later compression. It's not as if it's a
transformation that then compresses worse.

Once I get done with MAHOUT-302 (going to be another classic doozy of
a patch from me) I am happy to look into the real effects of this.

On Sun, May 2, 2010 at 8:45 PM, Drew Farris <drew.far...@gmail.com> wrote:
> Is this what is commonly referred to as zig-zag encoding? Avro uses the same
> technique:
> http://hadoop.apache.org/avro/docs/1.3.2/spec.html#binary_encoding
>
> For sequential sparse vectors it we could get an additional win by delta
> encoding the indexes. This would allow the index, stored as the difference
> from the previous index, to be kept to two bytes in many cases.
>
> Regardless, vint encoding will produce a significant space savings and
> Sean's right: it has also been my experience that space savings often trump
> speed simply because of the speed of network or storage.
>
> Do anyone have any idea whether greater gains to be found by finely tuning
> the base encoding vs. relying on some form of SequenceFile block
> compression? (or do both approaches compliment each other nicely?)

Reply via email to