[ 
https://issues.apache.org/jira/browse/MAHOUT-1236?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13672197#comment-13672197
 ] 

Sean Owen commented on MAHOUT-1236:
-----------------------------------

Yes it's probably very similar. The comment was more about size being an 
important concern here too. For example, simpler still is to use Java 
serialization. But it would serialize the class name with every instance, for 
example. For a billion small vectors that's huge overhead.

That's no issue with these other options, where the reader/writer already know 
the type and format anyway. The current 'format' is the ultimate in lean, 
really. The size increase from using protobufs/Thrift/Avro change would come 
from having to represent optional fields with additional bytes some other way, 
but that's still relatively minor. The big deal is representing integers 
compactly, I think. I don't know Thrift/Avro but assume they probably have some 
variable-length encoding too.

FWIW I don't think it's necessarily useful to support N serialization 
mechanisms, that's not what I was referring to.
But it's similar in the sense that the problem is that the serialized format 
isn't "polymorphic". You have to write this generic all-encompassing format and 
then have some object make (polymorphic, OOP) Java objects correctly from them. 
That's what VectorWritable does. It's OK because with Hadoop we have to declare 
the concrete type of the value upfront, and so were always going to need this 
level of indirection in order to fake polymorphism. That is, this lets you run 
a job that consumes "VectorWritable" and actually send it sparse or dense 
vectors.

Now, vectors aren't really going to change. They're indices and numbers. 
Decorators may change, and while decorators fit cleaning into OOP, they make 
the mismatch above worse. Right now it works fine with the 'named' extension 
(what doesn't work well there?). But if you want 10 other decorations to be 
represented, it will be unwieldy. That's the motivation for wanting a different 
format. But are there 10 other extensions that are really necessary?

How many times do you want to transparently handle either a plain Vector or 
DecoratedVector? If you actually want and need to know the difference, then you 
don't need to model this as a 'decoration' and don't have the problem above. 
Names? OK I can see not caring about whether it's named. Weights? yeah maybe. 
Centroids? what's special about centroids for example?

Anyway I think that's the real question. 


                
> Need a cleaned up serialized format for Vectors to handle names and all other 
> kinds of things
> ---------------------------------------------------------------------------------------------
>
>                 Key: MAHOUT-1236
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-1236
>             Project: Mahout
>          Issue Type: Bug
>            Reporter: Ted Dunning
>
> Our current serialization is subject several ills
> a) it breaks alignment by having a 1 byte flag field (evil, generic)
> b) it doesn't handle any kind of extensible format like protobufs so it isn't 
> future-proof
> c) it doesn't handle named vectors very well
> d) it totally breaks with any other kind of decoration as with Centroids or 
> WeightedVector or ... (see b)
> I propose that we use the current tag byte on the current serialization with 
> a new flag bit that indicates that the vector will use a protobuf encoding.  
> Then 3 bytes will be skipped to restore alignment.  Then there will be a 
> protobuf encoding for the vector. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to