What actually is the significance of s0, s1 and s2? Apologies if it is a dumb question but I do not find any comments in the code?
On Wed, Sep 26, 2012 at 7:19 PM, Jeff Eastman <[email protected]>wrote: > You did not mention the heap size configured on your cluster. As you > work on this problem, consider: > > - In all of the algorithms, all clusters are retained in memory by the > mappers and reducers > - Each cluster holds 4 sparse vectors internally (center, radius, s1 & > s2) > - Vectors tend to become more dense as iterations progress due to > summation of input vectors > - FuzzyK is the worst offender since it assigns every point to every > cluster with weight in each iteration > - Adjust T1=T2 until you get a reasonable number of clusters using > Canopy > - Text problems usually generate very wide, sparse vectors but the > clusters grow in size with iterations due to above > > > On 9/26/12 2:57 AM, paritosh ranjan wrote: > > On Wed, Sep 26, 2012 at 4:07 AM, Adair Kovac <[email protected]> > <[email protected]> wrote: > > > Hi folks, I'm running Mahout 0.7 and using the clustering commandline > tools. Problem is, the only one I can get to supply useful information on > my data set and small (3-node) cluster is kmeans, so far. > > > canopy either groups everything that isn't a starter-point into one cluster > > or gets GC out of memory errors.The "either" is based on my fiddling with t > values and MAHOUT_HEAPSIZE. > > > Values of t1 and t2 can also play a role here. You can adjust t2 upward > and that will reduce the number of canopies produced, which might help in > getting rid of memory issues. > > > > fkmeans throws Java heap space errors, even after I reduced my vectors set > to a whopping 24.0 MB (trying for 100 clusters). > > > The Fuzziness constraint might be too fuzzy. You can try with a stricter > one and loosen it step by step to find the breaking point. > > > > clusterdump similarly curls up and dies (heap space errors) when I try to > get it to dump all (or much more than 500 per cluster) of the clustered > points at the end of my kmeans algorithm. > > > Try to use clusterpp command, its not having any memory > problems.https://cwiki.apache.org/confluence/display/MAHOUT/Top+Down+Clustering > > kmeans took over 10 hours to run on 228.3MB of vectors, hitting the max > iterations of 10. (Right now I'm running it on a 969.0MB vector file, > hopefully it'll finish successfully.) > > > > The cluster currently only has 3 nodes, if I understood correctly, maybe > you can add more nodes to make it fast. > The KMeans by nature is a multiple iteration algorithm. One thing that can > be done is to find Canopies first and then run fewer iterations on KMeans > as the quality will be good if the initial clusters are proper, this can > significantly reduce total time executed. > > > > > I'm using small text documents, so the number + sparsity might be the > problem. > > > > Yes, might be. > > > > Are these issues unusual? Any advice on resolving them? Most of the google > hits for similar issues just suggest setting MAHOUT_HEAPSIZE to 2048. > > > Some tuning always helps to run it properly. > > > > -- Regards, Rahul K Mishra, www.ee.iitb.ac.in/student/~rahulkmishra<http://www.ee.iitb.ac.in/student/%7Erahulkmishra>
