Re: Clustering in Mahout 0.9 candidate

Ted Dunning Fri, 24 Jan 2014 14:44:19 -0800

Dang.  This community stuff is awesome.

Kudos to all you guys for jumping on this.


My only nit is whether this should move to the dev list.




On Fri, Jan 24, 2014 at 2:30 PM, Andrew Musselman <
andrew.mussel...@gmail.com> wrote:

> Thanks guys, I will look at it this weekend too.
>
>
> On Fri, Jan 24, 2014 at 2:24 PM, Pat Ferrel <p...@occamsmachete.com> wrote:
>
> > I have a setup using hadoop M/R kmeans for testing. If I can help in any
> > way let me know and if you don’t get to it I’ll have a look this weekend.
> >
> > Thanks
> >
> > On Jan 24, 2014, at 1:56 PM, Suneel Marthi <suneel_mar...@yahoo.com>
> > wrote:
> >
> > Pat,
> >
> > Andrew's not filed a JIRA for this, so thanks for filing M-1410 to track
> > this.
> >
> > The fix would be to modify ClusterIterator.iterateSeq() - (for the
> > Sequential mode) to read the vector key along with the vector.
> >
> > For the MR mode, CIMapper.java needs to be modified to read the vector
> key
> > along with the vector.
> >
> > The aforementioned fixes should take care of both KMeans and Fuzzy KMeans
> > clustering.
> >
> > I can work on a patch later today (should have something out by tonight).
> >
> >
> >
> >
> >
> > On Friday, January 24, 2014 4:47 PM, Pat Ferrel <p...@occamsmachete.com>
> > wrote:
> >
> > Yeah, it’s not really the issue with M-1030 but makes the fix unusable. I
> > apologize for not noticing this sooner, my own fault I guess.
> >
> > Did you file a JIRA against the larger issue? Any ETA on a fix (0.9?).
> > Should I go ahead and write my own cluster categorizer?
> >
> > You and Suneel pointed to the problem area but I’m not sure I know the
> > code well enough to patch it myself. I’m building the 1.0-snapshot so If
> > you have a suggestion I’d be happy to try it out. I’m sort of blocked on
> > some kind of fix for it.
> >
> > Thanks
> >
> >
> > On Jan 24, 2014, at 10:46 AM, Andrew Musselman <
> andrew.mussel...@gmail.com>
> > wrote:
> >
> > That's correct; I reported that last summer and didn't fix it in M-1030
> > since it didn't seem like that's what the group wanted in that bug.
> >
> > I see you're filing another bug, thanks.
> >
> >
> > On Fri, Jan 24, 2014 at 10:29 AM, Pat Ferrel <p...@occamsmachete.com>
> > wrote:
> >
> > > I can’t believe I haven’t noticed this before and so am hoping I’m
> > > mistaken…
> > >
> > > When you are using kmeans to cluster data where there is no “named”
> > > vector, clusteredPoints do not contain the vector ids so the cluster
> id,
> > > pdf, “distance-squared”, and vector dimensions are not tied to any
> known
> > > vector and so are, well, pretty much useless afaict.
> > >
> > > This means you have to loop through all your input vectors, recalculate
> > > any of the above values you need and categorize them yourself, right?
> Is
> > > this how it’s meant to work?
> > >
> > > I have used clustering before but had named vectors (text docs). Anyone
> > > clustering some intermediate Mahout DRM or vectors with no names will
> > have
> > > this problem.
> > >
> > > Someone please tell me I’ve slipped a gear...
> >
> >
>

Re: Clustering in Mahout 0.9 candidate

Reply via email to