Re: [jira] Commented: (MAHOUT-379) SequentialAccessSparseVector.equals does not agree with AbstractVector.equivalent

2010-04-18 Thread Ted Dunning
On Sun, Apr 18, 2010 at 3:37 PM, Jake Mannix wrote: > > > 4) VectorWritable acts still just as it is now, basically > > > > Yes, made it more general so we don't have to modify it to handle each > > new Vector impl too. > > > > The trick is to make the writing part efficient without knowing the

Re: [jira] Commented: (MAHOUT-379) SequentialAccessSparseVector.equals does not agree with AbstractVector.equivalent

2010-04-18 Thread Jeff Eastman
+1 NamedVector seems a lot like VectorView. I'm comfortable enough with this proposal for Sean to go forward with it . I agree with separating the naming/identifying/labeling into a separate wrapper class so that vectors themselves can be pure mathematical entities. Unifying as many as possible

Re: [jira] Commented: (MAHOUT-379) SequentialAccessSparseVector.equals does not agree with AbstractVector.equivalent

2010-04-18 Thread Jake Mannix
On Sun, Apr 18, 2010 at 3:23 PM, Sean Owen wrote: > On Sun, Apr 18, 2010 at 11:16 PM, Jake Mannix > wrote: > > VectorWritable currently is a proper decorator, right? It doesn't even > > implement Vector at all. > > Yeah, the other *Writable classes should be as well. NamedVector > should both b

Re: [jira] Commented: (MAHOUT-379) SequentialAccessSparseVector.equals does not agree with AbstractVector.equivalent

2010-04-18 Thread Sean Owen
On Sun, Apr 18, 2010 at 11:16 PM, Jake Mannix wrote: > VectorWritable currently is a proper decorator, right?  It doesn't even > implement Vector at all. Yeah, the other *Writable classes should be as well. NamedVector should both be a Vector and decorate a Vector too. Its Writable also decorates

Re: [jira] Commented: (MAHOUT-379) SequentialAccessSparseVector.equals does not agree with AbstractVector.equivalent

2010-04-18 Thread Jake Mannix
On Sun, Apr 18, 2010 at 12:30 PM, Sean Owen wrote: > I guess I'm suggesting the polymorphism pain need not be very painful. > (No doubt it's all nicer with Avro, but that much can be separate.) > > VectorWritable is the one Writable used in all cases. > We have *Writable decorators, corresponding

Re: [jira] Commented: (MAHOUT-379) SequentialAccessSparseVector.equals does not agree with AbstractVector.equivalent

2010-04-18 Thread Ted Dunning
Yes. I think that you could eliminate most of the current pain of writable polymorphism at the moderate cost of code maintenance down the line as new kinds of vectors get invented. We could even change the deal later. I wouldn't call this decorator so much as "generic writable" in that the writa

Re: [jira] Commented: (MAHOUT-379) SequentialAccessSparseVector.equals does not agree with AbstractVector.equivalent

2010-04-18 Thread Sean Owen
I guess I'm suggesting the polymorphism pain need not be very painful. (No doubt it's all nicer with Avro, but that much can be separate.) VectorWritable is the one Writable used in all cases. We have *Writable decorators, corresponding to *Vector, in a similar hierarchy. We have NamedVector decor

Re: [jira] Commented: (MAHOUT-379) SequentialAccessSparseVector.equals does not agree with AbstractVector.equivalent

2010-04-18 Thread Ted Dunning
This is a fundamental problem with Hadoop's type dispatch for writables. Building polymorphic writables is always a pain. On the other hand, with an Avro input format, polymorphism is pretty much assumed and comes for nearly free. My own preference for a naming layer is to allow us to have a pure

Re: [jira] Commented: (MAHOUT-379) SequentialAccessSparseVector.equals does not agree with AbstractVector.equivalent

2010-04-18 Thread Sean Owen
Good point and something that must be resolved. You're bordering on my next desired change, which maybe I lump into this massive patch. NamedVectorWritable extends NamedVector extends Vector. At first glance, great, I can treat NamedVectorWritable like any old Vector in my code and it's all compat

Re: [jira] Commented: (MAHOUT-379) SequentialAccessSparseVector.equals does not agree with AbstractVector.equivalent

2010-04-18 Thread Jake Mannix
It's not just that it is complicated, it's that say you want to do clustering. You make a SequenceFile of any old key type, and NamedVectorWritable as the value. Now you can't use that file as input for any DistributedRowMatrix operation, you have to do a full pass over the data to peel off the n

Re: [jira] Commented: (MAHOUT-379) SequentialAccessSparseVector.equals does not agree with AbstractVector.equivalent

2010-04-18 Thread Sean Owen
NamedVectorWritable would go with it. ... and if you're going to bring up that that gets a little complicated, I totally agree, and would love to get on a tangent about making this a decorator pattern rather than subclass. On Sun, Apr 18, 2010 at 7:26 PM, Jake Mannix wrote: > What would be the W

Re: [jira] Commented: (MAHOUT-379) SequentialAccessSparseVector.equals does not agree with AbstractVector.equivalent

2010-04-18 Thread Jake Mannix
What would be the Writable hierarchy with this NamedVector proposal? > > On Apr 18, 2010 11:05 AM, "Sean Owen" wrote: > > On keeping 'name': sure, I ... On Sun, Apr 18, 2010 at 6:45 PM, Jake Mannix wrote: > Ok this is a good con...

Re: [jira] Commented: (MAHOUT-379) SequentialAccessSparseVector.equals does not agree with AbstractVector.equivalent

2010-04-18 Thread Jake Mannix
I'm not convinced actually... but I can't properly express my type safety / flexibility concerns on this smartphone. I'll try to monkey with any available patches tonight, and maybe demonstrate what I mean by need for flexibility. -jake On Apr 18, 2010 11:05 AM, "Sean Owen" wrote: On keeping

Re: [jira] Commented: (MAHOUT-379) SequentialAccessSparseVector.equals does not agree with AbstractVector.equivalent

2010-04-18 Thread Jeff Eastman
Sure, maybe just initialize names to "" instead of null? private String name = ""; On 4/18/10 10:45 AM, Jake Mannix wrote: Ok this is a good concrete example, I like concrete. :) I'm still very wary of having to have some mapper or reducer classes deal with LabeledVector some deal with jus

Re: [jira] Commented: (MAHOUT-379) SequentialAccessSparseVector.equals does not agree with AbstractVector.equivalent

2010-04-18 Thread Sean Owen
On keeping 'name': sure, I don't mind being conservative. I would like to keep name in the form on NamedVector. As it happens, name is actually barely used right now -- if you can wade through the patch you can see there's just a few instances, the ones in mind now. Making NamedVector is, it seems

Re: [jira] Commented: (MAHOUT-379) SequentialAccessSparseVector.equals does not agree with AbstractVector.equivalent

2010-04-18 Thread Jake Mannix
Ok this is a good concrete example, I like concrete. :) I'm still very wary of having to have some mapper or reducer classes deal with LabeledVector some deal with just Vector... Maybe the right thing to do for now is keep the name in Vector directly, but fix the serialization which currently can

Re: [jira] Commented: (MAHOUT-379) SequentialAccessSparseVector.equals does not agree with AbstractVector.equivalent

2010-04-18 Thread Jeff Eastman
Also mean shift clustering relies on vector identities and tbd emitting its clustered points for CDbw would need to retain them. On 4/18/10 10:22 AM, Jeff Eastman wrote: CanopyClusterer.emitPointToExistingCanopies emits clusterId :: VectorWritable On 4/18/10 10:07 AM, Jake Mannix wrote: In

Re: [jira] Commented: (MAHOUT-379) SequentialAccessSparseVector.equals does not agree with AbstractVector.equivalent

2010-04-18 Thread Jeff Eastman
CanopyClusterer.emitPointToExistingCanopies emits clusterId :: VectorWritable On 4/18/10 10:07 AM, Jake Mannix wrote: In code we already have? -jake On Apr 18, 2010 9:53 AM, "Jeff Eastman" wrote: I can think of situations where I need to use a clusterId as the key-part and a Vector as t

Re: [jira] Commented: (MAHOUT-379) SequentialAccessSparseVector.equals does not agree with AbstractVector.equivalent

2010-04-18 Thread Jeff Eastman
I can think of situations where I need to use a clusterId as the key-part and a Vector as the value-part. If the Vector is going to have a consistent identity as it moves through jobs then that would need to be inside the Vector. On 4/18/10 8:41 AM, Jake Mannix wrote: Which one is "this"? Wr

Re: [jira] Commented: (MAHOUT-379) SequentialAccessSparseVector.equals does not agree with AbstractVector.equivalent

2010-04-18 Thread Jake Mannix
Well if you went the way of NamedVector for the current issue (the failing of TestFuzzyKMeans), you'd have to change the signature of the clusterer to take NamedVector so you could get access to the name, or else cast it, and that would get reversed as soon as we had the clusters act on the same st

Re: [jira] Commented: (MAHOUT-379) SequentialAccessSparseVector.equals does not agree with AbstractVector.equivalent

2010-04-18 Thread Sean Owen
I mean wrapping Vector in a NamedVector. It seems like a good step forward, even as I agree that it probably isn't even needed. Since I'm the one ripping up the floor-boards here to do some plumbing, seems like it should fall on me to put things back into a similar working state with NamedVector. T

Re: [jira] Commented: (MAHOUT-379) SequentialAccessSparseVector.equals does not agree with AbstractVector.equivalent

2010-04-18 Thread Jake Mannix
Which one is "this"? Wrapping Vector impls into a NamedVector/LabeledVector, or seeing if we even need the label *inside* of the Vector itself, and instead just having those live in the "key" part of the key-value pair in hadoop, like DistributedRowMatrix has it? -jake On Sun, Apr 18, 2010 at

Re: [jira] Commented: (MAHOUT-379) SequentialAccessSparseVector.equals does not agree with AbstractVector.equivalent

2010-04-18 Thread Sean Owen
Yeah why don't I have a crack at this. The change as it stands is already too big for what it is (though I believe they're good changes.) Then we look at more changes, and sounds like there are several ideas for streamlining vectors, which is a great thing to think about at this early stage. On Su