You did not mention the heap size configured on your cluster. As you work on this problem, consider:

 * In all of the algorithms, all clusters are retained in memory by the
   mappers and reducers
 * Each cluster holds 4 sparse vectors internally (center, radius, s1 & s2)
 * Vectors tend to become more dense as iterations progress due to
   summation of input vectors
 * FuzzyK is the worst offender since it assigns every point to every
   cluster with weight in each iteration
 * Adjust T1=T2 until you get a reasonable number of clusters using Canopy
 * Text problems usually generate very wide, sparse vectors but the
   clusters grow in size with iterations due to above


On 9/26/12 2:57 AM, paritosh ranjan wrote:
On Wed, Sep 26, 2012 at 4:07 AM, Adair Kovac <[email protected]> wrote:

Hi folks, I'm running Mahout 0.7 and using the clustering commandline
tools. Problem is, the only one I can get to supply useful information on
my data set and small (3-node) cluster is kmeans, so far.

canopy either groups everything that isn't a starter-point into one cluster
or gets GC out of memory errors.The "either" is based on my fiddling with t
values and MAHOUT_HEAPSIZE.

Values of t1 and t2  can also play a role here. You can adjust t2 upward
and that will reduce the number of canopies produced, which might help in
getting rid of memory issues.


fkmeans throws Java heap space errors, even after I reduced my vectors set
to a whopping 24.0 MB (trying for 100 clusters).

The Fuzziness constraint might be too fuzzy. You can try with a stricter
one and loosen it step by step to find the breaking point.


clusterdump similarly curls up and dies (heap space errors) when I try to
get it to dump all (or much more than 500 per cluster) of the clustered
points at the end of my kmeans algorithm.

Try to use clusterpp command, its not having any memory problems.
https://cwiki.apache.org/confluence/display/MAHOUT/Top+Down+Clustering


kmeans took over 10 hours to run on 228.3MB of vectors, hitting the max
iterations of 10. (Right now I'm running it on a 969.0MB vector file,
hopefully it'll finish successfully.)


The cluster currently only has 3 nodes, if I understood correctly, maybe
you can add more nodes to make it fast.
The KMeans by nature is a multiple iteration algorithm. One thing that can
be done is to find Canopies first and then run fewer iterations on  KMeans
as the quality will be good if the initial clusters are proper, this can
significantly reduce total time executed.



I'm using small text documents, so the number + sparsity might be the
problem.


Yes, might be.


Are these issues unusual? Any advice on resolving them? Most of the google
hits for similar issues just suggest setting MAHOUT_HEAPSIZE to 2048.

Some tuning always helps to run it properly.


Reply via email to