On Sun, May 2, 2010 at 9:40 PM, Sean Owen <sro...@gmail.com> wrote: > What's the specific improvement idea? > > Size and speed improvements would be good. The Hadoop serialization > mechanism is already pretty low-level, dealing directly in bytes (as > opposed to fancier stuff like Avro). It's if anything fast and lean > but quite manual. The latest Writable updates squeezed out most of the > remaining overhead. > > One thing to recall is that in the tradeoff between size and speed, a > test against a local ramdisk will make the cost of reading/writing > bytes artificially low. That is to say I'd just err more on the side > of compactness unless it makes a very big difference in decode time, > as I imagine the cost of decoding bytes is nothing compared to that of > storing and transmitting over a network. (Not to mention HDFS's work > to replicate those bytes, etc.) > > I suspect there might be some value in storing vector indices as > variable length ints, since they're usually not so large. I can also > imagine more compact variable length encodings than the one in > WritableUtils -- thinking of the encoding used in MIDI (and elsewhere > I'd guess), where 7 bits per byte are used and the top bit signals the > final value. IIRC WritableUtils always spends 8 bits writing the > length of the encoding. > You mean this type of encoding instead? http://code.google.com/apis/protocolbuffers/docs/encoding.html
> > On Sun, May 2, 2010 at 5:02 PM, Robin Anil <robin.a...@gmail.com> wrote: > > I am getting more and more ideas as I try to write about scaling Mahout > > clustering. I added serialize and de serialize benchmark for Vectors and > > checked the speed of our vectors. > > > > Here is the output with Cardinality=1000 Sparsity=1000(dense) > numVectors=100 > > loop=100 (hence writing 10K(int-doubles) to and reading back from disk) > > Note: that these are not disk MB/s but the size of vectors/per sec > > deserialized and the filesystem is a Ramdisk. >