Vector is simply any one of (array of doubles) or array of(int:double) and this info and other stuff are stored in a MetadataWritable. Makes sense to me, assuming MetadataWritable allows us to skip over efficiently without Deserializing
On Sun, Apr 25, 2010 at 8:58 PM, Sean Owen <[email protected]> wrote: > Yes, I think if we can convince ourselves that there won't be that > many different possibilities for representing a vector, then a simple > boolean might unify everything. This approach doesn't 'scale' but I > don't know there are other representations we must have. > > The issue of named vectors is interesting. There's not really such a > thing as an optional field in Hadoop serialization. You can fake it > with a boolean but that starts to be messy. > > Messy might be necessary as vectors perhaps take on more metadata -- > though I can't envision much more. So perhaps it is right and proper > to retain a second serialization format, in NamedVectorWritable, which > is really the "vector with metadata" serializer versus > VectorWritable's "pure vector" serializer. > > It has a logic to me. It gets rid of writing the class name which is > indeed unpalatable. > > Thoughts before I go tearing through again? > Let more comments come in before tearing it down. This affects everything. We *have to *get it right by the next release, not necessarily today or tomorrow. Or that would kind of kill the whole 0.3 users. Once fixed, we can provide a convertor to convert to the new representation. Robin
