syntheticcontroldata clustering example failure due to combiner

Adil Aijaz Wed, 10 Jun 2009 11:01:56 -0700

Hi folks,

I am new to mahout and I started exploring mahout 0.1 release by tryingto run the kmeans clustering example as described inhttp://cwiki.apache.org/MAHOUT/syntheticcontroldata.html

After a bunch of runs where no matter what parameters I specified, theoutput never changed I realized that:

1. KMeans was clustering all 600 points of syntheticcontroldata into onecluster.

2. There is a bug inexamples/src/main/java/org/apache/mahout/clustering/syntheticcontrol/kmeans/Job.javathat called runJob from main function with my provided argumentstransposed. So, my convergenceDelta was interpreted as t1, t1 as t2, andt2 as convergenceDelta. I will commit a patch as soon as I get approvalfor opensource commits from my employer, however, I thought I'd put itout there in case someone else is going through the same issue.

As for the more serious issue#1 (kmeans clustering everything into onecluster), I found that this is because the CanopyClusteringJob wasgenerating only one canopy. Digging deeper, I found that this problemwas coming from the CanopyCombiner being run in both map & reducephases. From there I discovered this post from december 2008:


http://tinyurl.com/l83ff4

which indicates that from hadoop 0.18 onwards the combiner will be runin both map and reduce which is bad since the CanopyCombiner andKMeansCombiner assume that they are executed only on map side. Now, thesuggested workaround is specific to hadoop 0.18 and it doesn't work withmahout-0.1 since it requires hadoop 0.19. This means a code fix isneeded for this issue. From the thread Grant talks about a patch(MAHOUT-99) that fixes the code but that patch is already part ofmahout-0.1 and so it apparently does not fix the issue.

All that to say, I haven't been able to get the kmeans clusteringexample on syntheticdata to work which is a bummer. My questions are:

1) Are there any open jiras on this issue (I didn't find any) ? If no,should I create one?

2) Any workarounds for now?


Adil

syntheticcontroldata clustering example failure due to combiner

Reply via email to