On Sun, May 2, 2010 at 9:40 PM, Sean Owen <sro...@gmail.com> wrote:

> What's the specific improvement idea?
>
> Size and speed improvements would be good. The Hadoop serialization
> mechanism is already pretty low-level, dealing directly in bytes (as
> opposed to fancier stuff like Avro). It's if anything fast and lean
> but quite manual. The latest Writable updates squeezed out most of the
> remaining overhead.
>
> One thing to recall is that in the tradeoff between size and speed, a
> test against a local ramdisk will make the cost of reading/writing
> bytes artificially low. That is to say I'd just err more on the side
> of compactness unless it makes a very big difference in decode time,
> as I imagine the cost of decoding bytes is nothing compared to that of
> storing and transmitting over a network. (Not to mention HDFS's work
> to replicate those bytes, etc.)
>
> I suspect there might be some value in storing vector indices as
> variable length ints, since they're usually not so large. I can also
> imagine more compact variable length encodings than the one in
> WritableUtils -- thinking of the encoding used in MIDI (and elsewhere
> I'd guess), where 7 bits per byte are used and the top bit signals the
> final value. IIRC WritableUtils always spends 8 bits writing the
> length of the encoding.
>
You mean this type of encoding instead?
 http://code.google.com/apis/protocolbuffers/docs/encoding.html

>
> On Sun, May 2, 2010 at 5:02 PM, Robin Anil <robin.a...@gmail.com> wrote:
> > I am getting more and  more ideas as I try to write about scaling Mahout
> > clustering. I added serialize and de serialize benchmark for Vectors and
> > checked the speed of our vectors.
> >
> > Here is the output with Cardinality=1000 Sparsity=1000(dense)
> numVectors=100
> > loop=100 (hence writing 10K(int-doubles) to and reading back from disk)
> > Note: that these are not disk MB/s but the size of vectors/per sec
> > deserialized and the filesystem is a Ramdisk.
>

Reply via email to