What's the specific improvement idea?

Size and speed improvements would be good. The Hadoop serialization
mechanism is already pretty low-level, dealing directly in bytes (as
opposed to fancier stuff like Avro). It's if anything fast and lean
but quite manual. The latest Writable updates squeezed out most of the
remaining overhead.

One thing to recall is that in the tradeoff between size and speed, a
test against a local ramdisk will make the cost of reading/writing
bytes artificially low. That is to say I'd just err more on the side
of compactness unless it makes a very big difference in decode time,
as I imagine the cost of decoding bytes is nothing compared to that of
storing and transmitting over a network. (Not to mention HDFS's work
to replicate those bytes, etc.)

I suspect there might be some value in storing vector indices as
variable length ints, since they're usually not so large. I can also
imagine more compact variable length encodings than the one in
WritableUtils -- thinking of the encoding used in MIDI (and elsewhere
I'd guess), where 7 bits per byte are used and the top bit signals the
final value. IIRC WritableUtils always spends 8 bits writing the
length of the encoding.

On Sun, May 2, 2010 at 5:02 PM, Robin Anil <robin.a...@gmail.com> wrote:
> I am getting more and  more ideas as I try to write about scaling Mahout
> clustering. I added serialize and de serialize benchmark for Vectors and
> checked the speed of our vectors.
>
> Here is the output with Cardinality=1000 Sparsity=1000(dense) numVectors=100
> loop=100 (hence writing 10K(int-doubles) to and reading back from disk)
> Note: that these are not disk MB/s but the size of vectors/per sec
> deserialized and the filesystem is a Ramdisk.

Reply via email to