[ https://issues.apache.org/jira/browse/MAHOUT-1236?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13672192#comment-13672192 ]
Jake Mannix commented on MAHOUT-1236: ------------------------------------- Thrift leaves off optional fields pretty well too, right? I've never seen much difference in the sizes of the thrifts vs. protobufs vs. raw writables here at Twitter (we've got some pretty heterogenous sources). What do you mean about a "VectorWritable" factory thing to work with hadoop? You mean something like ProtobufWritable<ProtoVector> or ThriftWritable<ThriftVector>, (where ProtoVector extends Message, and ThriftVector extends TBase) ? ElephantBird has some good utilities for this kind of thing. (e.g. https://github.com/kevinweil/elephant-bird/blob/master/core/src/main/java/com/twitter/elephantbird/mapreduce/io/ProtobufWritable.java and https://github.com/kevinweil/elephant-bird/blob/master/core/src/main/java/com/twitter/elephantbird/mapreduce/io/ThriftWritable.java ) > Need a cleaned up serialized format for Vectors to handle names and all other > kinds of things > --------------------------------------------------------------------------------------------- > > Key: MAHOUT-1236 > URL: https://issues.apache.org/jira/browse/MAHOUT-1236 > Project: Mahout > Issue Type: Bug > Reporter: Ted Dunning > > Our current serialization is subject several ills > a) it breaks alignment by having a 1 byte flag field (evil, generic) > b) it doesn't handle any kind of extensible format like protobufs so it isn't > future-proof > c) it doesn't handle named vectors very well > d) it totally breaks with any other kind of decoration as with Centroids or > WeightedVector or ... (see b) > I propose that we use the current tag byte on the current serialization with > a new flag bit that indicates that the vector will use a protobuf encoding. > Then 3 bytes will be skipped to restore alignment. Then there will be a > protobuf encoding for the vector. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira