Re: BallKMeans: all points in a cluster are considered when updating the center

Ted Dunning Fri, 16 Nov 2012 08:12:54 -0800

It is forgotten.

I was experimenting with different trimFractions and ultimately wound up
not trimming in my experiments.

The problem here is that ball k-means gives pretty strong probabilistic
guarantees for well separated clusters and good seeds if you only include
points much closer than the nearest other centroid.  For smeared data in
high dimension, an aggressive trim can result in zero points inside the
ball.

I didn't come up with a good answer.  My guess is that an adaptive scheme
might be useful where we start with an aggressive trim and relax it if we
don't get enough points.  How to do this robustly and still retain the
benefits of ball k-means is something I didn't have an answer for.  So I
punted and left the if in the code, but disabled it.

On Fri, Nov 16, 2012 at 4:47 AM, Dan Filimon <[email protected]>wrote:

> Hi,
>
> Ted, I'm testing the reducer and looking more closely at the current
> BallKMeans code, I was surprised when I got to line 245 [1].
> That if statement is always true and so the centroids are calculated
> using every point assigned to their cluster.
>
> So, trimFraction is never really used.
> Is using all points intentional or is the "true" in that if statement
> forgotten?
>
> [1]
> https://github.com/tdunning/knn/blob/master/src/main/java/org/apache/mahout/knn/cluster/BallKmeans.java#L245
>

Re: BallKMeans: all points in a cluster are considered when updating the center

Reply via email to