On Sun, Apr 18, 2010 at 12:30 PM, Sean Owen <sro...@gmail.com> wrote:

> I guess I'm suggesting the polymorphism pain need not be very painful.
> (No doubt it's all nicer with Avro, but that much can be separate.)
>
> VectorWritable is the one Writable used in all cases.
> We have *Writable decorators, corresponding to *Vector, in a similar
> hierarchy.
> We have NamedVector decorating Vector.
>
> I submit that solves all known issues here pretty well?
>

VectorWritable currently is a proper decorator, right?  It doesn't even
implement Vector at all.

What exactly are you suggesting the hierarchy to be?

  1) Vector is an interface, NamedVector extends it
  2) DenseVectorWritable implements Writable, does not implement Vector
  3) ditto for *AccessSparseVector
  4) VectorWritable acts still just as it is now, basically

In looking over the code, I noticed that both of the SparseVectors are
inefficient in their read()'ing, in that they are basically building up the
vectors from scratch, which is silly, esp. in the case of the
SequentialAccessSparseVector, which should be just reading in
the int[] and double[] and constructing the OrderedIntDoubleMapping
straight from that.  Waaaay faster.

  -jake


> enough that I should try it or is that giving it too much momentum?
>




>
> On Sun, Apr 18, 2010 at 8:13 PM, Ted Dunning <ted.dunn...@gmail.com>
> wrote:
> > This is a fundamental problem with Hadoop's type dispatch for writables.
> > Building polymorphic writables is always a pain.
> >
> > On the other hand, with an Avro input format, polymorphism is pretty much
> > assumed and comes for nearly free.
> >
> > My own preference for a naming layer is to allow us to have a pure vector
> > layer that is just math, not labels.  That does begin to make things
> > complex, though.
> >
> > This polymorphism pain makes putting the name into the vector and
> accepting
> > whatever strange semantics that result (missing == "" instead of null,
> for
> > instance) more attractive as a temporary measure.
> >
> > On Sun, Apr 18, 2010 at 11:44 AM, Jake Mannix <jake.man...@gmail.com>
> wrote:
> >
> >> It's not just that it is complicated, it's that say you want to do
> >> clustering.  You make a SequenceFile of any old key type, and
> >> NamedVectorWritable as the value.  Now you can't use that file as input
> for
> >> any DistributedRowMatrix operation, you have to do a full pass over the
> >> data
> >> to peel off the names and spit out regular VectorWritables...
> >>
> >
>

Reply via email to