[jira] Commented: (MAHOUT-379) SequentialAccessSparseVector.equals does not agree with AbstractVector.equivalent

2010-04-20 Thread Sean Owen (JIRA)
[ https://issues.apache.org/jira/browse/MAHOUT-379?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12858807#action_12858807 ] Sean Owen commented on MAHOUT-379: -- I'd like to commit this patch as it addresses a couple

Re: [jira] Commented: (MAHOUT-379) SequentialAccessSparseVector.equals does not agree with AbstractVector.equivalent

2010-04-18 Thread Ted Dunning
On Sun, Apr 18, 2010 at 3:37 PM, Jake Mannix wrote: > > > 4) VectorWritable acts still just as it is now, basically > > > > Yes, made it more general so we don't have to modify it to handle each > > new Vector impl too. > > > > The trick is to make the writing part efficient without knowing the

Re: [jira] Commented: (MAHOUT-379) SequentialAccessSparseVector.equals does not agree with AbstractVector.equivalent

2010-04-18 Thread Jeff Eastman
+1 NamedVector seems a lot like VectorView. I'm comfortable enough with this proposal for Sean to go forward with it . I agree with separating the naming/identifying/labeling into a separate wrapper class so that vectors themselves can be pure mathematical entities. Unifying as many as possible

Re: [jira] Commented: (MAHOUT-379) SequentialAccessSparseVector.equals does not agree with AbstractVector.equivalent

2010-04-18 Thread Jake Mannix
On Sun, Apr 18, 2010 at 3:23 PM, Sean Owen wrote: > On Sun, Apr 18, 2010 at 11:16 PM, Jake Mannix > wrote: > > VectorWritable currently is a proper decorator, right? It doesn't even > > implement Vector at all. > > Yeah, the other *Writable classes should be as well. NamedVector > should both b

Re: [jira] Commented: (MAHOUT-379) SequentialAccessSparseVector.equals does not agree with AbstractVector.equivalent

2010-04-18 Thread Sean Owen
On Sun, Apr 18, 2010 at 11:16 PM, Jake Mannix wrote: > VectorWritable currently is a proper decorator, right?  It doesn't even > implement Vector at all. Yeah, the other *Writable classes should be as well. NamedVector should both be a Vector and decorate a Vector too. Its Writable also decorates

Re: [jira] Commented: (MAHOUT-379) SequentialAccessSparseVector.equals does not agree with AbstractVector.equivalent

2010-04-18 Thread Jake Mannix
On Sun, Apr 18, 2010 at 12:30 PM, Sean Owen wrote: > I guess I'm suggesting the polymorphism pain need not be very painful. > (No doubt it's all nicer with Avro, but that much can be separate.) > > VectorWritable is the one Writable used in all cases. > We have *Writable decorators, corresponding

Re: [jira] Commented: (MAHOUT-379) SequentialAccessSparseVector.equals does not agree with AbstractVector.equivalent

2010-04-18 Thread Ted Dunning
Yes. I think that you could eliminate most of the current pain of writable polymorphism at the moderate cost of code maintenance down the line as new kinds of vectors get invented. We could even change the deal later. I wouldn't call this decorator so much as "generic writable" in that the writa

Re: [jira] Commented: (MAHOUT-379) SequentialAccessSparseVector.equals does not agree with AbstractVector.equivalent

2010-04-18 Thread Sean Owen
I guess I'm suggesting the polymorphism pain need not be very painful. (No doubt it's all nicer with Avro, but that much can be separate.) VectorWritable is the one Writable used in all cases. We have *Writable decorators, corresponding to *Vector, in a similar hierarchy. We have NamedVector decor

Re: [jira] Commented: (MAHOUT-379) SequentialAccessSparseVector.equals does not agree with AbstractVector.equivalent

2010-04-18 Thread Ted Dunning
This is a fundamental problem with Hadoop's type dispatch for writables. Building polymorphic writables is always a pain. On the other hand, with an Avro input format, polymorphism is pretty much assumed and comes for nearly free. My own preference for a naming layer is to allow us to have a pure

Re: [jira] Commented: (MAHOUT-379) SequentialAccessSparseVector.equals does not agree with AbstractVector.equivalent

2010-04-18 Thread Sean Owen
Good point and something that must be resolved. You're bordering on my next desired change, which maybe I lump into this massive patch. NamedVectorWritable extends NamedVector extends Vector. At first glance, great, I can treat NamedVectorWritable like any old Vector in my code and it's all compat

Re: [jira] Commented: (MAHOUT-379) SequentialAccessSparseVector.equals does not agree with AbstractVector.equivalent

2010-04-18 Thread Jake Mannix
It's not just that it is complicated, it's that say you want to do clustering. You make a SequenceFile of any old key type, and NamedVectorWritable as the value. Now you can't use that file as input for any DistributedRowMatrix operation, you have to do a full pass over the data to peel off the n

Re: [jira] Commented: (MAHOUT-379) SequentialAccessSparseVector.equals does not agree with AbstractVector.equivalent

2010-04-18 Thread Sean Owen
NamedVectorWritable would go with it. ... and if you're going to bring up that that gets a little complicated, I totally agree, and would love to get on a tangent about making this a decorator pattern rather than subclass. On Sun, Apr 18, 2010 at 7:26 PM, Jake Mannix wrote: > What would be the W

Re: [jira] Commented: (MAHOUT-379) SequentialAccessSparseVector.equals does not agree with AbstractVector.equivalent

2010-04-18 Thread Jake Mannix
What would be the Writable hierarchy with this NamedVector proposal? > > On Apr 18, 2010 11:05 AM, "Sean Owen" wrote: > > On keeping 'name': sure, I ... On Sun, Apr 18, 2010 at 6:45 PM, Jake Mannix wrote: > Ok this is a good con...

Re: [jira] Commented: (MAHOUT-379) SequentialAccessSparseVector.equals does not agree with AbstractVector.equivalent

2010-04-18 Thread Jake Mannix
I'm not convinced actually... but I can't properly express my type safety / flexibility concerns on this smartphone. I'll try to monkey with any available patches tonight, and maybe demonstrate what I mean by need for flexibility. -jake On Apr 18, 2010 11:05 AM, "Sean Owen" wrote: On keeping

Re: [jira] Commented: (MAHOUT-379) SequentialAccessSparseVector.equals does not agree with AbstractVector.equivalent

2010-04-18 Thread Jeff Eastman
Sure, maybe just initialize names to "" instead of null? private String name = ""; On 4/18/10 10:45 AM, Jake Mannix wrote: Ok this is a good concrete example, I like concrete. :) I'm still very wary of having to have some mapper or reducer classes deal with LabeledVector some deal with jus

Re: [jira] Commented: (MAHOUT-379) SequentialAccessSparseVector.equals does not agree with AbstractVector.equivalent

2010-04-18 Thread Sean Owen
On keeping 'name': sure, I don't mind being conservative. I would like to keep name in the form on NamedVector. As it happens, name is actually barely used right now -- if you can wade through the patch you can see there's just a few instances, the ones in mind now. Making NamedVector is, it seems

Re: [jira] Commented: (MAHOUT-379) SequentialAccessSparseVector.equals does not agree with AbstractVector.equivalent

2010-04-18 Thread Jake Mannix
Ok this is a good concrete example, I like concrete. :) I'm still very wary of having to have some mapper or reducer classes deal with LabeledVector some deal with just Vector... Maybe the right thing to do for now is keep the name in Vector directly, but fix the serialization which currently can

Re: [jira] Commented: (MAHOUT-379) SequentialAccessSparseVector.equals does not agree with AbstractVector.equivalent

2010-04-18 Thread Jeff Eastman
Also mean shift clustering relies on vector identities and tbd emitting its clustered points for CDbw would need to retain them. On 4/18/10 10:22 AM, Jeff Eastman wrote: CanopyClusterer.emitPointToExistingCanopies emits clusterId :: VectorWritable On 4/18/10 10:07 AM, Jake Mannix wrote: In

Re: [jira] Commented: (MAHOUT-379) SequentialAccessSparseVector.equals does not agree with AbstractVector.equivalent

2010-04-18 Thread Jeff Eastman
CanopyClusterer.emitPointToExistingCanopies emits clusterId :: VectorWritable On 4/18/10 10:07 AM, Jake Mannix wrote: In code we already have? -jake On Apr 18, 2010 9:53 AM, "Jeff Eastman" wrote: I can think of situations where I need to use a clusterId as the key-part and a Vector as t

Re: [jira] Commented: (MAHOUT-379) SequentialAccessSparseVector.equals does not agree with AbstractVector.equivalent

2010-04-18 Thread Jeff Eastman
I can think of situations where I need to use a clusterId as the key-part and a Vector as the value-part. If the Vector is going to have a consistent identity as it moves through jobs then that would need to be inside the Vector. On 4/18/10 8:41 AM, Jake Mannix wrote: Which one is "this"? Wr

Re: [jira] Commented: (MAHOUT-379) SequentialAccessSparseVector.equals does not agree with AbstractVector.equivalent

2010-04-18 Thread Jake Mannix
Well if you went the way of NamedVector for the current issue (the failing of TestFuzzyKMeans), you'd have to change the signature of the clusterer to take NamedVector so you could get access to the name, or else cast it, and that would get reversed as soon as we had the clusters act on the same st

Re: [jira] Commented: (MAHOUT-379) SequentialAccessSparseVector.equals does not agree with AbstractVector.equivalent

2010-04-18 Thread Sean Owen
I mean wrapping Vector in a NamedVector. It seems like a good step forward, even as I agree that it probably isn't even needed. Since I'm the one ripping up the floor-boards here to do some plumbing, seems like it should fall on me to put things back into a similar working state with NamedVector. T

Re: [jira] Commented: (MAHOUT-379) SequentialAccessSparseVector.equals does not agree with AbstractVector.equivalent

2010-04-18 Thread Jake Mannix
Which one is "this"? Wrapping Vector impls into a NamedVector/LabeledVector, or seeing if we even need the label *inside* of the Vector itself, and instead just having those live in the "key" part of the key-value pair in hadoop, like DistributedRowMatrix has it? -jake On Sun, Apr 18, 2010 at

Re: [jira] Commented: (MAHOUT-379) SequentialAccessSparseVector.equals does not agree with AbstractVector.equivalent

2010-04-18 Thread Sean Owen
Yeah why don't I have a crack at this. The change as it stands is already too big for what it is (though I believe they're good changes.) Then we look at more changes, and sounds like there are several ideas for streamlining vectors, which is a great thing to think about at this early stage. On Su

Re: [jira] Commented: (MAHOUT-379) SequentialAccessSparseVector.equals does not agree with AbstractVector.equivalent

2010-04-17 Thread Ted Dunning
How about this alternative: NamedVector: {Vector: wrapped, String: name} Vector: AbstractVector AbstractVector: DenseVector | SequentialSparseVector | HashSparseVector This avoids the multiplicative explosion of vector types. On Sat, Apr 17, 2010 at 4:17 PM, Robin Anil wrote: > Agreed. Thats

Re: [jira] Commented: (MAHOUT-379) SequentialAccessSparseVector.equals does not agree with AbstractVector.equivalent

2010-04-17 Thread Robin Anil
Agreed. Thats the correct way to go. But like I said, It warrants a complete overhaul and a separate JIRA issue. The quick fix I indicated ( i.e. putting the ID back in but removing it from compare/equals function) was just for this bug. How does this structuring sound? Vector(Interface) -> Abstr

Re: [jira] Commented: (MAHOUT-379) SequentialAccessSparseVector.equals does not agree with AbstractVector.equivalent

2010-04-17 Thread Ted Dunning
That would be a very, very good thing (uniform data usage). On Sat, Apr 17, 2010 at 2:52 PM, Jake Mannix wrote: > Currently, FuzzyKMeansClusterMapper has WritableComparable > keys which are ignored. Could we instead have the identifier for the > vector live there, where it makes sense? Then th

Re: [jira] Commented: (MAHOUT-379) SequentialAccessSparseVector.equals does not agree with AbstractVector.equivalent

2010-04-17 Thread Ted Dunning
How about we make a wrapper that can hold the name and use that in each of the clustering cases. Then the wrapper can store the name without polluting the semantics of the vectors. On Sat, Apr 17, 2010 at 2:52 PM, Jake Mannix wrote: > > For this bug, lets put the id back in and remove it from t

Re: [jira] Commented: (MAHOUT-379) SequentialAccessSparseVector.equals does not agree with AbstractVector.equivalent

2010-04-17 Thread Jake Mannix
On Sat, Apr 17, 2010 at 2:14 PM, Robin Anil wrote: > > For this bug, lets put the id back in and remove it from the > comparator/equals. Lets focus on getting the document structure correct > You mean put the 'name' back in? Since Sean has done the initial work of possibly completely removing it

Re: [jira] Commented: (MAHOUT-379) SequentialAccessSparseVector.equals does not agree with AbstractVector.equivalent

2010-04-17 Thread Robin Anil
On Sun, Apr 18, 2010 at 12:11 AM, Drew Farris wrote: > On Sat, Apr 17, 2010 at 2:23 PM, Sean Owen wrote: > > > > At the moment I want to understand how to patch up the fuzzy k-means > > code in this regard -- will probably switch to something slightly less > > state-dependent than asFormatString

Re: [jira] Commented: (MAHOUT-379) SequentialAccessSparseVector.equals does not agree with AbstractVector.equivalent

2010-04-17 Thread Drew Farris
On Sat, Apr 17, 2010 at 2:23 PM, Sean Owen wrote: > > At the moment I want to understand how to patch up the fuzzy k-means > code in this regard -- will probably switch to something slightly less > state-dependent than asFormatString() as a key and be done with it for > the moment. After looking

Re: [jira] Commented: (MAHOUT-379) SequentialAccessSparseVector.equals does not agree with AbstractVector.equivalent

2010-04-17 Thread Robin Anil
Why not just keep the identifier and not compare it when doing equals. ? Let it be like a tag of the vector. On Sat, Apr 17, 2010 at 11:53 PM, Sean Owen wrote: > At the moment I'm already overreaching on the way to fix MAHOUT-379 > with this patch, as I've expanded to address some mildly related

Re: [jira] Commented: (MAHOUT-379) SequentialAccessSparseVector.equals does not agree with AbstractVector.equivalent

2010-04-17 Thread Sean Owen
At the moment I'm already overreaching on the way to fix MAHOUT-379 with this patch, as I've expanded to address some mildly related issues (equals, iterators). So I personally am not trying to change serialization formats in MAHOUT-379 / my current patch, no. The issue uncovered by removing name

Re: [jira] Commented: (MAHOUT-379) SequentialAccessSparseVector.equals does not agree with AbstractVector.equivalent

2010-04-17 Thread Drew Farris
it is worth some investigation to determine if there is merit to adapting Mahout's MR jobs to use avro. Doug has recently committed a patch to avro (https://issues.apache.org/jira/browse/AVRO-493) that involves considerably less complexity than what I had originally proposed in https://issues.apach

Re: [jira] Commented: (MAHOUT-379) SequentialAccessSparseVector.equals does not agree with AbstractVector.equivalent

2010-04-17 Thread Ted Dunning
Yes. Probably a bad idea. But if we need a new Writable that carries labels, it might not be a bad idea. The other idea would be to build a labeling wrapper. On Sat, Apr 17, 2010 at 9:43 AM, Jeff Eastman wrote: > Seems like a major rewrite to replace Writable within our MR jobs. >

Re: [jira] Commented: (MAHOUT-379) SequentialAccessSparseVector.equals does not agree with AbstractVector.equivalent

2010-04-17 Thread Jeff Eastman
Are you thinking of replacing our Writable or Json (asFormatString) encodings? Certainly, using Avro as an I/O format for clustering would improve their utility for other languages. Seems like a major rewrite to replace Writable within our MR jobs. On 4/17/10 9:10 AM, Ted Dunning wrote: IF th

Re: [jira] Commented: (MAHOUT-379) SequentialAccessSparseVector.equals does not agree with AbstractVector.equivalent

2010-04-17 Thread Sean Owen
Yeah thats what I changed -- now the key is point.asFormatString(). And it almost works, except the serialized state in this format string includes lengthSquared, and a mismatch there before/after makes this fail. It may fail more significantly in the real world versus tests and we should be caut

Re: [jira] Commented: (MAHOUT-379) SequentialAccessSparseVector.equals does not agree with AbstractVector.equivalent

2010-04-17 Thread Ted Dunning
IF the format is about to change, should we look at avro to encode it? Drew seemed to like Avro in his document representation work. On Sat, Apr 17, 2010 at 9:05 AM, Jeff Eastman wrote: > Seems to me we need to rethink this step anyway if we are going to > implement the CDbw cluster evaluation a

Re: [jira] Commented: (MAHOUT-379) SequentialAccessSparseVector.equals does not agree with AbstractVector.equivalent

2010-04-17 Thread Jeff Eastman
Looking at the KMeansClusterer.outputPointWithClusterInfo it seems this code will have to change in the patch but I haven't yet looked: String name = point.getName(); String key = (name != null) && (name.length() != 0) ? name : point.asFormatString(); output.collect(new Text(key), n

[jira] Commented: (MAHOUT-379) SequentialAccessSparseVector.equals does not agree with AbstractVector.equivalent

2010-04-17 Thread Robin Anil (JIRA)
[ https://issues.apache.org/jira/browse/MAHOUT-379?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12858153#action_12858153 ] Robin Anil commented on MAHOUT-379: --- If the id from the vector is removed, I believe it w

[jira] Commented: (MAHOUT-379) SequentialAccessSparseVector.equals does not agree with AbstractVector.equivalent

2010-04-16 Thread Sean Owen (JIRA)
[ https://issues.apache.org/jira/browse/MAHOUT-379?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12857779#action_12857779 ] Sean Owen commented on MAHOUT-379: -- Yup, this is why I said "pre-patch", in reference to a

[jira] Commented: (MAHOUT-379) SequentialAccessSparseVector.equals does not agree with AbstractVector.equivalent

2010-04-16 Thread Danny Leshem (JIRA)
[ https://issues.apache.org/jira/browse/MAHOUT-379?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12857778#action_12857778 ] Danny Leshem commented on MAHOUT-379: - Sean, your patch neither fixes the original issu

Re: [jira] Commented: (MAHOUT-379) SequentialAccessSparseVector.equals does not agree with AbstractVector.equivalent

2010-04-14 Thread Jake Mannix
+1 -jake On Apr 14, 2010 3:20 PM, "Jeff Eastman" wrote: Ted Dunning wrote: > > On Wed, Apr 14, 2010 at 12:53 PM, Sean Owen < sro...@gmail.com> wrote: > > >... +1 from the creator thereof, even. Especially since they never got used.

Re: [jira] Commented: (MAHOUT-379) SequentialAccessSparseVector.equals does not agree with AbstractVector.equivalent

2010-04-14 Thread Jeff Eastman
Ted Dunning wrote: On Wed, Apr 14, 2010 at 12:53 PM, Sean Owen wrote: I would actually prefer ripping names out of the base vectors entirely. They should decorate the mathematical vector, but as their use is decidedly non-mathematical and application specific (we basically do

Re: [jira] Commented: (MAHOUT-379) SequentialAccessSparseVector.equals does not agree with AbstractVector.equivalent

2010-04-14 Thread Ted Dunning
On Wed, Apr 14, 2010 at 12:53 PM, Sean Owen wrote: > > > I would actually prefer ripping names out of the base vectors entirely. > > They should decorate the mathematical vector, but as their use is > decidedly > > non-mathematical and application specific (we basically don't use them at > > all

Re: [jira] Commented: (MAHOUT-379) SequentialAccessSparseVector.equals does not agree with AbstractVector.equivalent

2010-04-14 Thread Sean Owen
On Wed, Apr 14, 2010 at 3:28 PM, Jake Mannix wrote: > What is the transitivity problem?  If (a instanceof VClassA), (b instanceof > VClassB) and (c instanceof VClassC), if all three equals() methods compare > the same things (ie values, names, not implementation), then a.equals(b) && > b.equals(c)

Re: [jira] Commented: (MAHOUT-379) SequentialAccessSparseVector.equals does not agree with AbstractVector.equivalent

2010-04-14 Thread Jake Mannix
Ok, back on list with this then (Thanks Danny for reminding us to deal with this perennial issue we have!) On Wed, Apr 14, 2010 at 2:26 AM, Sean Owen (JIRA) wrote: > > Yeah let's take some time to get this right. At the moment I see four > notions of equivalence in Vector (which is down from fi

[jira] Commented: (MAHOUT-379) SequentialAccessSparseVector.equals does not agree with AbstractVector.equivalent

2010-04-14 Thread Grant Ingersoll (JIRA)
[ https://issues.apache.org/jira/browse/MAHOUT-379?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12856835#action_12856835 ] Grant Ingersoll commented on MAHOUT-379: I think we probably should have a discussi

[jira] Commented: (MAHOUT-379) SequentialAccessSparseVector.equals does not agree with AbstractVector.equivalent

2010-04-14 Thread Sean Owen (JIRA)
[ https://issues.apache.org/jira/browse/MAHOUT-379?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12856812#action_12856812 ] Sean Owen commented on MAHOUT-379: -- Yeah let's take some time to get this right. At the mo