On Thu, Sep 13, 2012 at 9:06 AM, Jeff Eastman <[email protected]>wrote:

> On 9/12/12 10:42 PM, Ted Dunning wrote:
> -user@
> +dev@
>
> Well, this is not really an apples-to-apples comparison is it? Running any
> Hadoop job through 200 iterations is unlikely to ever take less than 200
> minutes because of Hadoop's setup-teardown overhead.


Yes.  That is exactly the point.  A one-pass algorithm is a very good thing
in the Hadoop environment.

I didn't say, but should point out, that this algorithm is easily adapted
map-reduce and still uses a single pass with speeds linearly faster than
the single machine case.

And, while a sequential, in-memory clustering algorithm may produce
> excellent results on a single machine even over large data sets, it isn't
> k-means.  K-means requires every point to be tested against every cluster
> during every iteration.


Actually, it is.  K-means is a general framework, not a single algorithm.
 There are convergence bounds that show that this single-pass algorithm is
a good estimate of the optimal answer for well-clusterable data.


> So saying that Mahout k-means is "slow" as a general statement kinda
> bothers me because it implies a comparison to a different, Hadoop
> implementation that AFAICT has not been done.
>

This is a very fair point.


> But maybe I'm just being too sensitive about all the work that has gone
> into making Mahout k-means as good as it is...
>

That has been a huge amount of work.

And the limitations of the new stuff should also be mentioned ... in
particular, it only supports L_2 distance (and really cannot be easily
extended).

Reply via email to