On Sat, Apr 17, 2010 at 2:14 PM, Robin Anil <robin.a...@gmail.com> wrote:
>
> For this bug, lets put the id back in and remove it from the
> comparator/equals. Lets focus on getting the document structure correct
>

You mean put the 'name' back in?

Since Sean has done the initial work of possibly completely removing it,
let's think for a bit on whether we can actually get away with pulling this
off.

In the distributed matrix code, vector "ids" are separate from the vector
instances themselves: the data set is key-value pairs, with the key being
an integer for the row, and the value being a vector.  If the user wants to
associate these ids with longer names, they can, but for the sake of
symmetry, the keys into the rows and columns are ints (maybe at some
point we need to address the need for long values, but that's another
discussion).

Ideally, clustering would also act on the same data structure - it's a
list of Vector instances, right?  It's just using the internalized "name"
as an identifier.

Currently, FuzzyKMeansClusterMapper has WritableComparable<?>
keys which are ignored.  Could we instead have the identifier for the
vector live there, where it makes sense?  Then that same key could
be mapper output key, instead of the name of the Vector.

This kind of change could get the clustering code to effectively be
able to run sensibly on the same SequenceFile<IntWritable,VectorWritable>
that DistributedRowMatrix is running on, and that would be very nice,
I think.

  -jake

So part of the problem which was to be solved by removing the name from
the base AbstractVector was that upon serializing a vector with a null
name, we deserialized to "" instead.  We can fix that by just "being
careful",
I guess.



> Robin
>
> > Drew
> >
>

Reply via email to