Vector is simply any one of (array of doubles) or array of(int:double) and
this info and other stuff are stored in a MetadataWritable. Makes sense to
me, assuming MetadataWritable allows us to skip over efficiently without
Deserializing


On Sun, Apr 25, 2010 at 8:58 PM, Sean Owen <sro...@gmail.com> wrote:

> Yes, I think if we can convince ourselves that there won't be that
> many different possibilities for representing a vector, then a simple
> boolean might unify everything. This approach doesn't 'scale' but I
> don't know there are other representations we must have.
>
> The issue of named vectors is interesting. There's not really such a
> thing as an optional field in Hadoop serialization. You can fake it
> with a boolean but that starts to be messy.
>
> Messy might be necessary as vectors perhaps take on more metadata --
> though I can't envision much more. So perhaps it is right and proper
> to retain a second serialization format, in NamedVectorWritable, which
> is really the "vector with metadata" serializer versus
> VectorWritable's "pure vector" serializer.
>
> It has a logic to me. It gets rid of writing the class name which is
> indeed unpalatable.
>
> Thoughts before I go tearing through again?
>
Let more comments come in before tearing it down. This affects everything.
We *have to *get it right by the next release, not necessarily today or
tomorrow. Or that would kind of kill the whole 0.3 users. Once fixed, we can
provide a convertor to convert to the new representation.

Robin

Reply via email to