[
https://issues.apache.org/jira/browse/MAHOUT-1236?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13672197#comment-13672197
]
Sean Owen commented on MAHOUT-1236:
-----------------------------------
Yes it's probably very similar. The comment was more about size being an
important concern here too. For example, simpler still is to use Java
serialization. But it would serialize the class name with every instance, for
example. For a billion small vectors that's huge overhead.
That's no issue with these other options, where the reader/writer already know
the type and format anyway. The current 'format' is the ultimate in lean,
really. The size increase from using protobufs/Thrift/Avro change would come
from having to represent optional fields with additional bytes some other way,
but that's still relatively minor. The big deal is representing integers
compactly, I think. I don't know Thrift/Avro but assume they probably have some
variable-length encoding too.
FWIW I don't think it's necessarily useful to support N serialization
mechanisms, that's not what I was referring to.
But it's similar in the sense that the problem is that the serialized format
isn't "polymorphic". You have to write this generic all-encompassing format and
then have some object make (polymorphic, OOP) Java objects correctly from them.
That's what VectorWritable does. It's OK because with Hadoop we have to declare
the concrete type of the value upfront, and so were always going to need this
level of indirection in order to fake polymorphism. That is, this lets you run
a job that consumes "VectorWritable" and actually send it sparse or dense
vectors.
Now, vectors aren't really going to change. They're indices and numbers.
Decorators may change, and while decorators fit cleaning into OOP, they make
the mismatch above worse. Right now it works fine with the 'named' extension
(what doesn't work well there?). But if you want 10 other decorations to be
represented, it will be unwieldy. That's the motivation for wanting a different
format. But are there 10 other extensions that are really necessary?
How many times do you want to transparently handle either a plain Vector or
DecoratedVector? If you actually want and need to know the difference, then you
don't need to model this as a 'decoration' and don't have the problem above.
Names? OK I can see not caring about whether it's named. Weights? yeah maybe.
Centroids? what's special about centroids for example?
Anyway I think that's the real question.
> Need a cleaned up serialized format for Vectors to handle names and all other
> kinds of things
> ---------------------------------------------------------------------------------------------
>
> Key: MAHOUT-1236
> URL: https://issues.apache.org/jira/browse/MAHOUT-1236
> Project: Mahout
> Issue Type: Bug
> Reporter: Ted Dunning
>
> Our current serialization is subject several ills
> a) it breaks alignment by having a 1 byte flag field (evil, generic)
> b) it doesn't handle any kind of extensible format like protobufs so it isn't
> future-proof
> c) it doesn't handle named vectors very well
> d) it totally breaks with any other kind of decoration as with Centroids or
> WeightedVector or ... (see b)
> I propose that we use the current tag byte on the current serialization with
> a new flag bit that indicates that the vector will use a protobuf encoding.
> Then 3 bytes will be skipped to restore alignment. Then there will be a
> protobuf encoding for the vector.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira