[
https://issues.apache.org/jira/browse/MAHOUT-379?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12858807#action_12858807
]
Sean Owen commented on MAHOUT-379:
--
I'd like to commit this patch as it addresses a couple
On Sun, Apr 18, 2010 at 3:37 PM, Jake Mannix wrote:
> > > 4) VectorWritable acts still just as it is now, basically
> >
> > Yes, made it more general so we don't have to modify it to handle each
> > new Vector impl too.
> >
>
> The trick is to make the writing part efficient without knowing the
+1 NamedVector seems a lot like VectorView. I'm comfortable enough with
this proposal for Sean to go forward with it . I agree with
separating the naming/identifying/labeling into a separate wrapper class
so that vectors themselves can be pure mathematical entities. Unifying
as many as possible
On Sun, Apr 18, 2010 at 3:23 PM, Sean Owen wrote:
> On Sun, Apr 18, 2010 at 11:16 PM, Jake Mannix
> wrote:
> > VectorWritable currently is a proper decorator, right? It doesn't even
> > implement Vector at all.
>
> Yeah, the other *Writable classes should be as well. NamedVector
> should both b
On Sun, Apr 18, 2010 at 11:16 PM, Jake Mannix wrote:
> VectorWritable currently is a proper decorator, right? It doesn't even
> implement Vector at all.
Yeah, the other *Writable classes should be as well. NamedVector
should both be a Vector and decorate a Vector too. Its Writable also
decorates
On Sun, Apr 18, 2010 at 12:30 PM, Sean Owen wrote:
> I guess I'm suggesting the polymorphism pain need not be very painful.
> (No doubt it's all nicer with Avro, but that much can be separate.)
>
> VectorWritable is the one Writable used in all cases.
> We have *Writable decorators, corresponding
Yes. I think that you could eliminate most of the current pain of writable
polymorphism at the moderate cost of code maintenance down the line as new
kinds of vectors get invented. We could even change the deal later.
I wouldn't call this decorator so much as "generic writable" in that the
writa
I guess I'm suggesting the polymorphism pain need not be very painful.
(No doubt it's all nicer with Avro, but that much can be separate.)
VectorWritable is the one Writable used in all cases.
We have *Writable decorators, corresponding to *Vector, in a similar hierarchy.
We have NamedVector decor
This is a fundamental problem with Hadoop's type dispatch for writables.
Building polymorphic writables is always a pain.
On the other hand, with an Avro input format, polymorphism is pretty much
assumed and comes for nearly free.
My own preference for a naming layer is to allow us to have a pure
Good point and something that must be resolved. You're bordering on my
next desired change, which maybe I lump into this massive patch.
NamedVectorWritable extends NamedVector extends Vector. At first
glance, great, I can treat NamedVectorWritable like any old Vector in
my code and it's all compat
It's not just that it is complicated, it's that say you want to do
clustering. You make a SequenceFile of any old key type, and
NamedVectorWritable as the value. Now you can't use that file as input for
any DistributedRowMatrix operation, you have to do a full pass over the data
to peel off the n
NamedVectorWritable would go with it.
... and if you're going to bring up that that gets a little
complicated, I totally agree, and would love to get on a tangent about
making this a decorator pattern rather than subclass.
On Sun, Apr 18, 2010 at 7:26 PM, Jake Mannix wrote:
> What would be the W
What would be the Writable hierarchy with this NamedVector proposal?
> > On Apr 18, 2010 11:05 AM, "Sean Owen" wrote: > > On
keeping 'name': sure, I ...
On Sun, Apr 18, 2010 at 6:45 PM, Jake Mannix wrote:
> Ok this is a good con...
I'm not convinced actually... but I can't properly express my type safety /
flexibility concerns on this smartphone. I'll try to monkey with any
available patches tonight, and maybe demonstrate what I mean by need for
flexibility.
-jake
On Apr 18, 2010 11:05 AM, "Sean Owen" wrote:
On keeping
Sure, maybe just initialize names to "" instead of null?
private String name = "";
On 4/18/10 10:45 AM, Jake Mannix wrote:
Ok this is a good concrete example, I like concrete. :)
I'm still very wary of having to have some mapper or reducer classes deal
with LabeledVector some deal with jus
On keeping 'name': sure, I don't mind being conservative. I would like
to keep name in the form on NamedVector. As it happens, name is
actually barely used right now -- if you can wade through the patch
you can see there's just a few instances, the ones in mind now.
Making NamedVector is, it seems
Ok this is a good concrete example, I like concrete. :)
I'm still very wary of having to have some mapper or reducer classes deal
with LabeledVector some deal with just Vector...
Maybe the right thing to do for now is keep the name in Vector directly, but
fix the serialization which currently can
Also mean shift clustering relies on vector identities and tbd emitting
its clustered points for CDbw would need to retain them.
On 4/18/10 10:22 AM, Jeff Eastman wrote:
CanopyClusterer.emitPointToExistingCanopies emits clusterId ::
VectorWritable
On 4/18/10 10:07 AM, Jake Mannix wrote:
In
CanopyClusterer.emitPointToExistingCanopies emits clusterId ::
VectorWritable
On 4/18/10 10:07 AM, Jake Mannix wrote:
In code we already have?
-jake
On Apr 18, 2010 9:53 AM, "Jeff Eastman" wrote:
I can think of situations where I need to use a clusterId as the key-part
and a Vector as t
I can think of situations where I need to use a clusterId as the
key-part and a Vector as the value-part. If the Vector is going to have
a consistent identity as it moves through jobs then that would need to
be inside the Vector.
On 4/18/10 8:41 AM, Jake Mannix wrote:
Which one is "this"? Wr
Well if you went the way of NamedVector for the current issue (the failing
of TestFuzzyKMeans), you'd have to change the signature of the clusterer
to take NamedVector so you could get access to the name, or else cast
it, and that would get reversed as soon as we had the clusters act on the
same st
I mean wrapping Vector in a NamedVector. It seems like a good step
forward, even as I agree that it probably isn't even needed. Since I'm
the one ripping up the floor-boards here to do some plumbing, seems
like it should fall on me to put things back into a similar working
state with NamedVector. T
Which one is "this"? Wrapping Vector impls into a
NamedVector/LabeledVector,
or seeing if we even need the label *inside* of the Vector itself, and
instead
just having those live in the "key" part of the key-value pair in hadoop,
like
DistributedRowMatrix has it?
-jake
On Sun, Apr 18, 2010 at
Yeah why don't I have a crack at this. The change as it stands is
already too big for what it is (though I believe they're good
changes.) Then we look at more changes, and sounds like there are
several ideas for streamlining vectors, which is a great thing to
think about at this early stage.
On Su
How about this alternative:
NamedVector: {Vector: wrapped, String: name}
Vector: AbstractVector
AbstractVector: DenseVector | SequentialSparseVector | HashSparseVector
This avoids the multiplicative explosion of vector types.
On Sat, Apr 17, 2010 at 4:17 PM, Robin Anil wrote:
> Agreed. Thats
Agreed. Thats the correct way to go. But like I said, It warrants a complete
overhaul and a separate JIRA issue. The quick fix I indicated ( i.e. putting
the ID back in but removing it from compare/equals function) was just for
this bug.
How does this structuring sound?
Vector(Interface) -> Abstr
That would be a very, very good thing (uniform data usage).
On Sat, Apr 17, 2010 at 2:52 PM, Jake Mannix wrote:
> Currently, FuzzyKMeansClusterMapper has WritableComparable
> keys which are ignored. Could we instead have the identifier for the
> vector live there, where it makes sense? Then th
How about we make a wrapper that can hold the name and use that in each of
the clustering cases. Then the wrapper can store the name without polluting
the semantics of the vectors.
On Sat, Apr 17, 2010 at 2:52 PM, Jake Mannix wrote:
> > For this bug, lets put the id back in and remove it from t
On Sat, Apr 17, 2010 at 2:14 PM, Robin Anil wrote:
>
> For this bug, lets put the id back in and remove it from the
> comparator/equals. Lets focus on getting the document structure correct
>
You mean put the 'name' back in?
Since Sean has done the initial work of possibly completely removing it
On Sun, Apr 18, 2010 at 12:11 AM, Drew Farris wrote:
> On Sat, Apr 17, 2010 at 2:23 PM, Sean Owen wrote:
> >
> > At the moment I want to understand how to patch up the fuzzy k-means
> > code in this regard -- will probably switch to something slightly less
> > state-dependent than asFormatString
On Sat, Apr 17, 2010 at 2:23 PM, Sean Owen wrote:
>
> At the moment I want to understand how to patch up the fuzzy k-means
> code in this regard -- will probably switch to something slightly less
> state-dependent than asFormatString() as a key and be done with it for
> the moment.
After looking
Why not just keep the identifier and not compare it when doing equals. ? Let
it be like a tag of the vector.
On Sat, Apr 17, 2010 at 11:53 PM, Sean Owen wrote:
> At the moment I'm already overreaching on the way to fix MAHOUT-379
> with this patch, as I've expanded to address some mildly related
At the moment I'm already overreaching on the way to fix MAHOUT-379
with this patch, as I've expanded to address some mildly related
issues (equals, iterators).
So I personally am not trying to change serialization formats in
MAHOUT-379 / my current patch, no. The issue uncovered by removing
name
it is worth some investigation to determine if there is merit to
adapting Mahout's MR jobs to use avro. Doug has recently committed a
patch to avro (https://issues.apache.org/jira/browse/AVRO-493) that
involves considerably less complexity than what I had originally
proposed in https://issues.apach
Yes. Probably a bad idea.
But if we need a new Writable that carries labels, it might not be a bad
idea. The other idea would be to build a labeling wrapper.
On Sat, Apr 17, 2010 at 9:43 AM, Jeff Eastman wrote:
> Seems like a major rewrite to replace Writable within our MR jobs.
>
Are you thinking of replacing our Writable or Json (asFormatString)
encodings? Certainly, using Avro as an I/O format for clustering would
improve their utility for other languages. Seems like a major rewrite to
replace Writable within our MR jobs.
On 4/17/10 9:10 AM, Ted Dunning wrote:
IF th
Yeah thats what I changed -- now the key is point.asFormatString().
And it almost works, except the serialized state in this format string
includes lengthSquared, and a mismatch there before/after makes this
fail.
It may fail more significantly in the real world versus tests and we
should be caut
IF the format is about to change, should we look at avro to encode it? Drew
seemed to like Avro in his document representation work.
On Sat, Apr 17, 2010 at 9:05 AM, Jeff Eastman wrote:
> Seems to me we need to rethink this step anyway if we are going to
> implement the CDbw cluster evaluation a
Looking at the KMeansClusterer.outputPointWithClusterInfo it seems this
code will have to change in the patch but I haven't yet looked:
String name = point.getName();
String key = (name != null) && (name.length() != 0) ? name :
point.asFormatString();
output.collect(new Text(key), n
[
https://issues.apache.org/jira/browse/MAHOUT-379?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12858153#action_12858153
]
Robin Anil commented on MAHOUT-379:
---
If the id from the vector is removed, I believe it w
[
https://issues.apache.org/jira/browse/MAHOUT-379?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12857779#action_12857779
]
Sean Owen commented on MAHOUT-379:
--
Yup, this is why I said "pre-patch", in reference to a
[
https://issues.apache.org/jira/browse/MAHOUT-379?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12857778#action_12857778
]
Danny Leshem commented on MAHOUT-379:
-
Sean, your patch neither fixes the original issu
+1
-jake
On Apr 14, 2010 3:20 PM, "Jeff Eastman" wrote:
Ted Dunning wrote: > > On Wed, Apr 14, 2010 at 12:53 PM, Sean Owen <
sro...@gmail.com> wrote: > > >...
+1 from the creator thereof, even. Especially since they never got used.
Ted Dunning wrote:
On Wed, Apr 14, 2010 at 12:53 PM, Sean Owen wrote:
I would actually prefer ripping names out of the base vectors entirely.
They should decorate the mathematical vector, but as their use is
decidedly
non-mathematical and application specific (we basically do
On Wed, Apr 14, 2010 at 12:53 PM, Sean Owen wrote:
>
> > I would actually prefer ripping names out of the base vectors entirely.
> > They should decorate the mathematical vector, but as their use is
> decidedly
> > non-mathematical and application specific (we basically don't use them at
> > all
On Wed, Apr 14, 2010 at 3:28 PM, Jake Mannix wrote:
> What is the transitivity problem? If (a instanceof VClassA), (b instanceof
> VClassB) and (c instanceof VClassC), if all three equals() methods compare
> the same things (ie values, names, not implementation), then a.equals(b) &&
> b.equals(c)
Ok, back on list with this then (Thanks Danny for reminding us to deal with
this perennial issue we have!)
On Wed, Apr 14, 2010 at 2:26 AM, Sean Owen (JIRA) wrote:
>
> Yeah let's take some time to get this right. At the moment I see four
> notions of equivalence in Vector (which is down from fi
[
https://issues.apache.org/jira/browse/MAHOUT-379?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12856835#action_12856835
]
Grant Ingersoll commented on MAHOUT-379:
I think we probably should have a discussi
[
https://issues.apache.org/jira/browse/MAHOUT-379?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12856812#action_12856812
]
Sean Owen commented on MAHOUT-379:
--
Yeah let's take some time to get this right. At the mo
49 matches
Mail list logo