Re: VectorWritable bug et al

Ted Dunning Thu, 11 Feb 2010 12:10:00 -0800

On Thu, Feb 11, 2010 at 11:51 AM, Sean Owen <sro...@gmail.com> wrote:

> On Thu, Feb 11, 2010 at 6:37 PM, Jake Mannix <jake.man...@gmail.com>
> wrote:
> > Why would the sparse representation be the only way to represent it
> > on disk?  It's nearly twice as big as the dense form for dense vectors
> > (ok, 50% bigger).
>
> On disk (well, in any serialized form) you just have key-value,
> key-value pairs in sequence, right? Access time is irrelevant, so this
> representation is most space-efficient. Why's it bigger?
>

Because in a dense matrix or vector, you don't need the keys.  Keeping both
keys and values makes the representation larger.

Moreover, specialized matrices can be kept more cheaply.  Thus, a dense
integer matrix could use integer serialization with no keys resulting in
very small storage with systems like Avro.

A sparse binary matrix can be kept using just a delta encoded list of keys.

There is no single representation that is best, but Avro should make it
relatively easy to choose the on-disk representation at write-time.

In any case, statics are evil.  Slightly less evil than true global
variables, but not all that far behind.  If we can do without, we should.

Re: VectorWritable bug et al

Reply via email to