Pat,

Andrew's not filed a JIRA for this, so thanks for filing M-1410 to track this.

The fix would be to modify ClusterIterator.iterateSeq() - (for the Sequential 
mode) to read the vector key along with the vector.

For the MR mode, CIMapper.java needs to be modified to read the vector key 
along with the vector.

The aforementioned fixes should take care of both KMeans and Fuzzy KMeans 
clustering.

I can work on a patch later today (should have something out by tonight).





On Friday, January 24, 2014 4:47 PM, Pat Ferrel <p...@occamsmachete.com> wrote:
 
Yeah, it’s not really the issue with M-1030 but makes the fix unusable. I 
apologize for not noticing this sooner, my own fault I guess.

Did you file a JIRA against the larger issue? Any ETA on a fix (0.9?). Should I 
go ahead and write my own cluster categorizer?

You and Suneel pointed to the problem area but I’m not sure I know the code 
well enough to patch it myself. I’m building the 1.0-snapshot so If you have a 
suggestion I’d be happy to try it out. I’m sort of blocked on some kind of fix 
for it.

Thanks


On Jan 24, 2014, at 10:46 AM, Andrew Musselman <andrew.mussel...@gmail.com> 
wrote:

That's correct; I reported that last summer and didn't fix it in M-1030
since it didn't seem like that's what the group wanted in that bug.

I see you're filing another bug, thanks.


On Fri, Jan 24, 2014 at 10:29 AM, Pat Ferrel <p...@occamsmachete.com> wrote:

> I can’t believe I haven’t noticed this before and so am hoping I’m
> mistaken…
> 
> When you are using kmeans to cluster data where there is no “named”
> vector, clusteredPoints do not contain the vector ids so the cluster id,
> pdf, “distance-squared”, and vector dimensions are not tied to any known
> vector and so are, well, pretty much useless afaict.
> 
> This means you have to loop through all your input vectors, recalculate
> any of the above values you need and categorize them yourself, right? Is
> this how it’s meant to work?
> 
> I have used clustering before but had named vectors (text docs). Anyone
> clustering some intermediate Mahout DRM or vectors with no names will have
> this problem.
> 
> Someone please tell me I’ve slipped a gear...

Reply via email to