Well that goes down another interesting road. I think we have all
enjoyed the idea of keeping "decoration" out of the core Vector
implementation. Vector and its subclasses represent only different
ways of representing elements and values.

Notions like name (and ideally, labels) are farmed out to a decorator
like NamedVector.

This is all wonderful in the object-oriented world. The language and
object layout in memory are most happy for you to treat a NamedVector
as just a Vector; the extra data in memory is irrelevant.

It gets tricky when trying to write Writables for all this, since when
reading a sequence of objects from a stream you can't somehow know a
priori that there's more data in there than you expect and what to
ignore, and how. You don't have a parallel hierarchy of Writables --
it doesn't work that way. Instead you need one factory
(VectorWritable) that has knowledge of the serialized form of all
these things.

(Well, we did initially just serialize with each Vector the name of
its corresponding Writable. This is a tidy solution indeed, but is a
lot of overhead. So that went away.)


Back to Robin's point:
If there is a need for such a thing as a "weighted vector", then I
suggest that instead of injecting a field in Vector, it become another
decorator class. Likewise, labels should really be handled this way.
Yes, then VectorWritable needs another header bit for "weighted" and
needs to reconstruct the vector appropriately. It starts to get messy,
but works.

My original question was, do we need a "weighted vector" entity? or is
this only used in a context where one needs to serialize "a vector,
and a weight too". In the latter case, fine, easy: it should simply
compose rather than extend VectorWritable IMHO.


To Drew's question:

No, and that's the issue, really. A file of MultiLableVectorWritable
cannot be read by VectorWritable since the latter does not expect that
extra data. It's not quite a Hadoop issue, but simply that the OO
world's object representation in memory doesn't exactly translate to
serializing to a stream neatly.

Yes I would mark VectorWritable final.


To Jeff:

Yeah I guess we're agreed there. In the interest of not rocking the
boat too much I'd be pleased to tease apart these Writables to start,
have a good discussion here, and then make these wrappers if needed
later. It's not yet clear to me we need "WeightedVector", for
instance.



On Mon, Sep 13, 2010 at 1:29 PM, Robin Anil <[email protected]> wrote:
> On top of that I question whether we need MultiLabelVector and
> WeightedVector at all, cant multilabel and instance weight be nullable
> fields inside Vector. I dont see a big win (space efficiency) in keeping a
> verbose name and separate subclasses just to add 2 fields. A boolean bit can
> be kept to see if weight or labels are serialized.
>

Reply via email to