On Thu, Feb 11, 2010 at 11:51 AM, Sean Owen <[email protected]> wrote:
> On Thu, Feb 11, 2010 at 6:37 PM, Jake Mannix <[email protected]> > wrote: > > Why would the sparse representation be the only way to represent it > > on disk? It's nearly twice as big as the dense form for dense vectors > > (ok, 50% bigger). > > On disk (well, in any serialized form) you just have key-value, > key-value pairs in sequence, right? Access time is irrelevant, so this > representation is most space-efficient. Why's it bigger? > Because in a dense matrix or vector, you don't need the keys. Keeping both keys and values makes the representation larger. Moreover, specialized matrices can be kept more cheaply. Thus, a dense integer matrix could use integer serialization with no keys resulting in very small storage with systems like Avro. A sparse binary matrix can be kept using just a delta encoded list of keys. There is no single representation that is best, but Avro should make it relatively easy to choose the on-disk representation at write-time. In any case, statics are evil. Slightly less evil than true global variables, but not all that far behind. If we can do without, we should.
