On Sun, Apr 18, 2010 at 3:37 PM, Jake Mannix wrote:
> > > 4) VectorWritable acts still just as it is now, basically
> >
> > Yes, made it more general so we don't have to modify it to handle each
> > new Vector impl too.
> >
>
> The trick is to make the writing part efficient without knowing the
+1 NamedVector seems a lot like VectorView. I'm comfortable enough with
this proposal for Sean to go forward with it . I agree with
separating the naming/identifying/labeling into a separate wrapper class
so that vectors themselves can be pure mathematical entities. Unifying
as many as possible
On Sun, Apr 18, 2010 at 3:23 PM, Sean Owen wrote:
> On Sun, Apr 18, 2010 at 11:16 PM, Jake Mannix
> wrote:
> > VectorWritable currently is a proper decorator, right? It doesn't even
> > implement Vector at all.
>
> Yeah, the other *Writable classes should be as well. NamedVector
> should both b
On Sun, Apr 18, 2010 at 11:16 PM, Jake Mannix wrote:
> VectorWritable currently is a proper decorator, right? It doesn't even
> implement Vector at all.
Yeah, the other *Writable classes should be as well. NamedVector
should both be a Vector and decorate a Vector too. Its Writable also
decorates
On Sun, Apr 18, 2010 at 12:30 PM, Sean Owen wrote:
> I guess I'm suggesting the polymorphism pain need not be very painful.
> (No doubt it's all nicer with Avro, but that much can be separate.)
>
> VectorWritable is the one Writable used in all cases.
> We have *Writable decorators, corresponding
Yes. I think that you could eliminate most of the current pain of writable
polymorphism at the moderate cost of code maintenance down the line as new
kinds of vectors get invented. We could even change the deal later.
I wouldn't call this decorator so much as "generic writable" in that the
writa
I guess I'm suggesting the polymorphism pain need not be very painful.
(No doubt it's all nicer with Avro, but that much can be separate.)
VectorWritable is the one Writable used in all cases.
We have *Writable decorators, corresponding to *Vector, in a similar hierarchy.
We have NamedVector decor
This is a fundamental problem with Hadoop's type dispatch for writables.
Building polymorphic writables is always a pain.
On the other hand, with an Avro input format, polymorphism is pretty much
assumed and comes for nearly free.
My own preference for a naming layer is to allow us to have a pure
Good point and something that must be resolved. You're bordering on my
next desired change, which maybe I lump into this massive patch.
NamedVectorWritable extends NamedVector extends Vector. At first
glance, great, I can treat NamedVectorWritable like any old Vector in
my code and it's all compat
It's not just that it is complicated, it's that say you want to do
clustering. You make a SequenceFile of any old key type, and
NamedVectorWritable as the value. Now you can't use that file as input for
any DistributedRowMatrix operation, you have to do a full pass over the data
to peel off the n
NamedVectorWritable would go with it.
... and if you're going to bring up that that gets a little
complicated, I totally agree, and would love to get on a tangent about
making this a decorator pattern rather than subclass.
On Sun, Apr 18, 2010 at 7:26 PM, Jake Mannix wrote:
> What would be the W
What would be the Writable hierarchy with this NamedVector proposal?
> > On Apr 18, 2010 11:05 AM, "Sean Owen" wrote: > > On
keeping 'name': sure, I ...
On Sun, Apr 18, 2010 at 6:45 PM, Jake Mannix wrote:
> Ok this is a good con...
I'm not convinced actually... but I can't properly express my type safety /
flexibility concerns on this smartphone. I'll try to monkey with any
available patches tonight, and maybe demonstrate what I mean by need for
flexibility.
-jake
On Apr 18, 2010 11:05 AM, "Sean Owen" wrote:
On keeping
Sure, maybe just initialize names to "" instead of null?
private String name = "";
On 4/18/10 10:45 AM, Jake Mannix wrote:
Ok this is a good concrete example, I like concrete. :)
I'm still very wary of having to have some mapper or reducer classes deal
with LabeledVector some deal with jus
On keeping 'name': sure, I don't mind being conservative. I would like
to keep name in the form on NamedVector. As it happens, name is
actually barely used right now -- if you can wade through the patch
you can see there's just a few instances, the ones in mind now.
Making NamedVector is, it seems
Ok this is a good concrete example, I like concrete. :)
I'm still very wary of having to have some mapper or reducer classes deal
with LabeledVector some deal with just Vector...
Maybe the right thing to do for now is keep the name in Vector directly, but
fix the serialization which currently can
Also mean shift clustering relies on vector identities and tbd emitting
its clustered points for CDbw would need to retain them.
On 4/18/10 10:22 AM, Jeff Eastman wrote:
CanopyClusterer.emitPointToExistingCanopies emits clusterId ::
VectorWritable
On 4/18/10 10:07 AM, Jake Mannix wrote:
In
CanopyClusterer.emitPointToExistingCanopies emits clusterId ::
VectorWritable
On 4/18/10 10:07 AM, Jake Mannix wrote:
In code we already have?
-jake
On Apr 18, 2010 9:53 AM, "Jeff Eastman" wrote:
I can think of situations where I need to use a clusterId as the key-part
and a Vector as t
I can think of situations where I need to use a clusterId as the
key-part and a Vector as the value-part. If the Vector is going to have
a consistent identity as it moves through jobs then that would need to
be inside the Vector.
On 4/18/10 8:41 AM, Jake Mannix wrote:
Which one is "this"? Wr
Well if you went the way of NamedVector for the current issue (the failing
of TestFuzzyKMeans), you'd have to change the signature of the clusterer
to take NamedVector so you could get access to the name, or else cast
it, and that would get reversed as soon as we had the clusters act on the
same st
I mean wrapping Vector in a NamedVector. It seems like a good step
forward, even as I agree that it probably isn't even needed. Since I'm
the one ripping up the floor-boards here to do some plumbing, seems
like it should fall on me to put things back into a similar working
state with NamedVector. T
Which one is "this"? Wrapping Vector impls into a
NamedVector/LabeledVector,
or seeing if we even need the label *inside* of the Vector itself, and
instead
just having those live in the "key" part of the key-value pair in hadoop,
like
DistributedRowMatrix has it?
-jake
On Sun, Apr 18, 2010 at
Yeah why don't I have a crack at this. The change as it stands is
already too big for what it is (though I believe they're good
changes.) Then we look at more changes, and sounds like there are
several ideas for streamlining vectors, which is a great thing to
think about at this early stage.
On Su
23 matches
Mail list logo