Re: Streaming KMeans distance cutoff

Dan Filimon Wed, 08 May 2013 10:01:21 -0700

On Wed, May 8, 2013 at 7:48 PM, Ted Dunning <ted.dunn...@gmail.com> wrote:

> Inline
>
>
> On Wed, May 8, 2013 at 8:49 AM, Andy Twigg <andy.tw...@gmail.com> wrote:
>
> > both of those make sense to me.
> >
> >
> >
> > On 8 May 2013 16:45, Dan Filimon <dangeorge.fili...@gmail.com> wrote:
> >
> >> Hi Ted!
> >>
> >> I recently talked to one of the authors of streaming k-means, Adam
> >> Meyerson
> >> asking about the distance cutoff as I wasn't sure of a right value for
> >> this.
> >>
> >> He told me two things:
> >> - that we should multiply the distance / distanceCutoff ratio by the
> >> weight
> >> of the point we're trying to cluster so as to avoid collapsing larger
> >> clusters
> >>
> >
> This makes sense, but I have no idea what the effect will be.
>

I think it avoids the need of the special way we handle the increase of
distanceCutoff by beta in another if.

We can multiply it by beta every single time with the same results. The
comment said as much:

        // In the original algorithm, with distributions with sharp scale
effects, the
        // distanceCutoff can grow to an excessive size leading
sub-clustering to collapse
        // the centroids set too much. This test prevents increase in
distanceCutoff if
        // the current value is doing well at collapsing the clusters.

> >  - the initial cutoff they use is 1 / numClusters basically
> >>
> >
> This is just wrong.
>
> It is fine in theoretical settings with synthetic clusters of unit
> characteristic scale, but is not invariant over uniform scaling so it can't
> be correct.

But the thing is it's not really a distance at all. I've seen it increase
to far beyond what a maximum distance between clusters should be.
So, for example, in a case where the maximum distance is 2, it went all the
way up to 14. :)

It's only a distance conceptually but it's more like a "factor". They
actually call it a "facility cost" rather than a distance, probably for
this reason.

> >
> >> As I tested the code on multiple well known data sets, this got me
> >> thinking
> >> of removing the distanceCutoff all together.
> >> It seems like just another parameter to get right with only limited real
> >> value of fiddling with it.
> >>
> >
> I think it is core to the algorithm.  It is adapted to the data in any
> case.
>
>
> >> Additionally, clusterOvershoot, the thing we're using to delay distance
> >> cutoff increases also seems somewhat unnecessary. Why use it and get a
> lot
> >> more centroids than what we asked for.
> >>
> >
> Well, it needs to be there until we get more run-time.  Getting more
> centroids doesn't hurt and getting too few definitely does hurt.  Thus we
> need some insurance.

Well, it would get as many clusters as it needs to. And we increase
numClusters as we go until it's clusterLogFactor * numProcessedDatapoints.
I understand the point that we should get more rather than less centroids,
but k log n (what we're getting at the end) seems fine.
The estimate we give it at the beginning is only valid as long as not
enough datapoints have been processed to go over k log n.

 >
> >> I want to post a final version for review, but I just wanted to mention
> >> these two things.
> >>
> >> It's not like they "hurt" really, they just don't seem to be helping too
> >> much and I'd rather have something that more closely matches the
> >> theoretical guarantees in the paper.
> >>
> >> What do you think?
> >>
> >
> I think we already do better than some of the guarantees in the paper.
>

I think we do the same, and I'm wondering whether the extra knobs do
anything or not. I tried fiddling with them a bit and they didn't seem to
do much.
I also haven't worked through the math for our particular version.

Re: Streaming KMeans distance cutoff

Reply via email to