Hi folks,
I am new to mahout and I started exploring mahout 0.1 release by trying
to run the kmeans clustering example as described in
http://cwiki.apache.org/MAHOUT/syntheticcontroldata.html
After a bunch of runs where no matter what parameters I specified, the
output never changed I realized that:
1. KMeans was clustering all 600 points of syntheticcontroldata into one
cluster.
2. There is a bug in
examples/src/main/java/org/apache/mahout/clustering/syntheticcontrol/kmeans/Job.java
that called runJob from main function with my provided arguments
transposed. So, my convergenceDelta was interpreted as t1, t1 as t2, and
t2 as convergenceDelta. I will commit a patch as soon as I get approval
for opensource commits from my employer, however, I thought I'd put it
out there in case someone else is going through the same issue.
As for the more serious issue#1 (kmeans clustering everything into one
cluster), I found that this is because the CanopyClusteringJob was
generating only one canopy. Digging deeper, I found that this problem
was coming from the CanopyCombiner being run in both map & reduce
phases. From there I discovered this post from december 2008:
http://tinyurl.com/l83ff4
which indicates that from hadoop 0.18 onwards the combiner will be run
in both map and reduce which is bad since the CanopyCombiner and
KMeansCombiner assume that they are executed only on map side. Now, the
suggested workaround is specific to hadoop 0.18 and it doesn't work with
mahout-0.1 since it requires hadoop 0.19. This means a code fix is
needed for this issue. From the thread Grant talks about a patch
(MAHOUT-99) that fixes the code but that patch is already part of
mahout-0.1 and so it apparently does not fix the issue.
All that to say, I haven't been able to get the kmeans clustering
example on syntheticdata to work which is a bummer. My questions are:
1) Are there any open jiras on this issue (I didn't find any) ? If no,
should I create one?
2) Any workarounds for now?
Adil