On Thu, Sep 13, 2012 at 9:06 AM, Jeff Eastman <[email protected]>wrote:
> On 9/12/12 10:42 PM, Ted Dunning wrote: > -user@ > +dev@ > > Well, this is not really an apples-to-apples comparison is it? Running any > Hadoop job through 200 iterations is unlikely to ever take less than 200 > minutes because of Hadoop's setup-teardown overhead. Yes. That is exactly the point. A one-pass algorithm is a very good thing in the Hadoop environment. I didn't say, but should point out, that this algorithm is easily adapted map-reduce and still uses a single pass with speeds linearly faster than the single machine case. And, while a sequential, in-memory clustering algorithm may produce > excellent results on a single machine even over large data sets, it isn't > k-means. K-means requires every point to be tested against every cluster > during every iteration. Actually, it is. K-means is a general framework, not a single algorithm. There are convergence bounds that show that this single-pass algorithm is a good estimate of the optimal answer for well-clusterable data. > So saying that Mahout k-means is "slow" as a general statement kinda > bothers me because it implies a comparison to a different, Hadoop > implementation that AFAICT has not been done. > This is a very fair point. > But maybe I'm just being too sensitive about all the work that has gone > into making Mahout k-means as good as it is... > That has been a huge amount of work. And the limitations of the new stuff should also be mentioned ... in particular, it only supports L_2 distance (and really cannot be easily extended).
