[ 
https://issues.apache.org/jira/browse/MAHOUT-1236?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13672192#comment-13672192
 ] 

Jake Mannix commented on MAHOUT-1236:
-------------------------------------

Thrift leaves off optional fields pretty well too, right?  I've never seen much 
difference in the sizes of the thrifts vs. protobufs vs. raw writables here at 
Twitter (we've got some pretty heterogenous sources).  

What do you mean about a "VectorWritable" factory thing to work with hadoop?  
You mean something like ProtobufWritable<ProtoVector> or 
ThriftWritable<ThriftVector>, (where ProtoVector extends Message, and 
ThriftVector extends TBase) ?  ElephantBird has some good utilities for this 
kind of thing. (e.g. 
https://github.com/kevinweil/elephant-bird/blob/master/core/src/main/java/com/twitter/elephantbird/mapreduce/io/ProtobufWritable.java
 and 
https://github.com/kevinweil/elephant-bird/blob/master/core/src/main/java/com/twitter/elephantbird/mapreduce/io/ThriftWritable.java
 )
                
> Need a cleaned up serialized format for Vectors to handle names and all other 
> kinds of things
> ---------------------------------------------------------------------------------------------
>
>                 Key: MAHOUT-1236
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-1236
>             Project: Mahout
>          Issue Type: Bug
>            Reporter: Ted Dunning
>
> Our current serialization is subject several ills
> a) it breaks alignment by having a 1 byte flag field (evil, generic)
> b) it doesn't handle any kind of extensible format like protobufs so it isn't 
> future-proof
> c) it doesn't handle named vectors very well
> d) it totally breaks with any other kind of decoration as with Centroids or 
> WeightedVector or ... (see b)
> I propose that we use the current tag byte on the current serialization with 
> a new flag bit that indicates that the vector will use a protobuf encoding.  
> Then 3 bytes will be skipped to restore alignment.  Then there will be a 
> protobuf encoding for the vector. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to