Is mahout kmeans slow ?

2012-09-12 Thread Elaine Gan
Hi, I'm trying to do some text analysis using mahout kmeans (clustering), processing the data on hadoop. --numClusters = 160 --maxIter (-x) maxIter = 200 Well my data is small, around 500MB . I have 4 servers, each with 4CPU and TaskTrackers are set to 4 as maximum. When i run the mahout task, i

Re: Is mahout kmeans slow ?

2012-09-12 Thread Ted Dunning
Yes. I have been working (slowly) on moving some very fast single pass clustering into Mahout. My work in progress currently does very fast clustering of small dense vectors and it should scale to sparse vectors fairly well with some small changes. See https://github.com/tdunning/knn for more in

Re: Is mahout kmeans slow ?

2012-09-12 Thread Pat Ferrel
200 iterations? What is your convergence delta? If it is too small for your distance measure you will perform all 200 iterations, every time you cluster. --convergenceDelta (-cd) convergenceDelta The convergence delta value. Default is 0.5 I wo

Re: Is mahout kmeans slow ?

2012-09-12 Thread Ted Dunning
Also, with 500MB of data, this is likely to only take a few minutes on a single machine with the new clustering stuff. It is hard to estimate precisely, however, due to the difference between dense and sparse cases. On Wed, Sep 12, 2012 at 8:42 PM, Pat Ferrel wrote: > 200 iterations? > > What i

Re: Is mahout kmeans slow ?

2012-09-12 Thread Elaine Gan
My -cd was quite loose, set it at 0.1 Hmm.. maybe the data is too small, causing the low performance..? > 200 iterations? > > What is your convergence delta? If it is too small for your distance measure > you will perform all 200 iterations, every time you cluster. > > --convergenceDelta (

Re: Is mahout kmeans slow ?

2012-09-12 Thread Paritosh Ranjan
You can also try to find initial clusters first using canopy clustering, its a fast single iteration clustering algorithm. https://cwiki.apache.org/confluence/display/MAHOUT/Canopy+Clustering Canopy clustering would provide you better initial clusters which you can feed into kmeans for faster c

Re: Is mahout kmeans slow ?

2012-09-13 Thread Pat Ferrel
What distance measure? On Sep 12, 2012, at 10:37 PM, Elaine Gan wrote: My -cd was quite loose, set it at 0.1 Hmm.. maybe the data is too small, causing the low performance..? > 200 iterations? > > What is your convergence delta? If it is too small for your distance measure > you will perfor

Re: Is mahout kmeans slow ?

2012-09-13 Thread Pat Ferrel
Actually if it is really taking 200 iterations then it is never matching your convergence delta. That means either your data does not cluster well or you convergence delta is still to tight. I was suggesting that you loosen the convergence delta until it only takes 10-20 iterations to cluster t