On 9/12/12 10:42 PM, Ted Dunning wrote:
-user@
+dev@

Well, this is not really an apples-to-apples comparison is it? Running any Hadoop job through 200 iterations is unlikely to ever take less than 200 minutes because of Hadoop's setup-teardown overhead. And, while a sequential, in-memory clustering algorithm may produce excellent results on a single machine even over large data sets, it isn't k-means. K-means requires every point to be tested against every cluster during every iteration. So saying that Mahout k-means is "slow" as a general statement kinda bothers me because it implies a comparison to a different, Hadoop implementation that AFAICT has not been done.

But maybe I'm just being too sensitive about all the work that has gone into making Mahout k-means as good as it is...
Yes.

I have been working (slowly) on moving some very fast single pass
clustering into Mahout.  My work in progress currently does very fast
clustering of small dense vectors and it should scale to sparse vectors
fairly well with some small changes.

See https://github.com/tdunning/knn for more info.

On Wed, Sep 12, 2012 at 7:26 PM, Elaine Gan <[email protected]> wrote:

Any ways to improve on the mahout kmeans to speed it up?


Reply via email to