Well that goes down another interesting road. I think we have all enjoyed the idea of keeping "decoration" out of the core Vector implementation. Vector and its subclasses represent only different ways of representing elements and values.
Notions like name (and ideally, labels) are farmed out to a decorator like NamedVector. This is all wonderful in the object-oriented world. The language and object layout in memory are most happy for you to treat a NamedVector as just a Vector; the extra data in memory is irrelevant. It gets tricky when trying to write Writables for all this, since when reading a sequence of objects from a stream you can't somehow know a priori that there's more data in there than you expect and what to ignore, and how. You don't have a parallel hierarchy of Writables -- it doesn't work that way. Instead you need one factory (VectorWritable) that has knowledge of the serialized form of all these things. (Well, we did initially just serialize with each Vector the name of its corresponding Writable. This is a tidy solution indeed, but is a lot of overhead. So that went away.) Back to Robin's point: If there is a need for such a thing as a "weighted vector", then I suggest that instead of injecting a field in Vector, it become another decorator class. Likewise, labels should really be handled this way. Yes, then VectorWritable needs another header bit for "weighted" and needs to reconstruct the vector appropriately. It starts to get messy, but works. My original question was, do we need a "weighted vector" entity? or is this only used in a context where one needs to serialize "a vector, and a weight too". In the latter case, fine, easy: it should simply compose rather than extend VectorWritable IMHO. To Drew's question: No, and that's the issue, really. A file of MultiLableVectorWritable cannot be read by VectorWritable since the latter does not expect that extra data. It's not quite a Hadoop issue, but simply that the OO world's object representation in memory doesn't exactly translate to serializing to a stream neatly. Yes I would mark VectorWritable final. To Jeff: Yeah I guess we're agreed there. In the interest of not rocking the boat too much I'd be pleased to tease apart these Writables to start, have a good discussion here, and then make these wrappers if needed later. It's not yet clear to me we need "WeightedVector", for instance. On Mon, Sep 13, 2010 at 1:29 PM, Robin Anil <[email protected]> wrote: > On top of that I question whether we need MultiLabelVector and > WeightedVector at all, cant multilabel and instance weight be nullable > fields inside Vector. I dont see a big win (space efficiency) in keeping a > verbose name and separate subclasses just to add 2 fields. A boolean bit can > be kept to see if weight or labels are serialized. >
