Re: Streaming KMeans distance cutoff

Ted Dunning Wed, 08 May 2013 10:10:12 -0700

On Wed, May 8, 2013 at 10:00 AM, Dan Filimon <dangeorge.fili...@gmail.com>wrote:


> On Wed, May 8, 2013 at 7:48 PM, Ted Dunning <ted.dunn...@gmail.com> wrote:
>
> > Inline
> >
> >
> > >> He told me two things:
> > >> - that we should multiply the distance / distanceCutoff ratio by the
> > >> weight
> > >> of the point we're trying to cluster so as to avoid collapsing larger
> > >> clusters
> > >>
> > >
> > This makes sense, but I have no idea what the effect will be.
> >
>
> I think it avoids the need of the special way we handle the increase of
> distanceCutoff by beta in another if.
>

Sure.  Sounds right and all.

But experiment will tell better.


> > >  - the initial cutoff they use is 1 / numClusters basically
> > >>
> > >
> > This is just wrong.
> >
> > It is fine in theoretical settings with synthetic clusters of unit
> > characteristic scale, but is not invariant over uniform scaling so it
> can't
> > be correct.
>
>
> But the thing is it's not really a distance at all. I've seen it increase
> to far beyond what a maximum distance between clusters should be.
> So, for example, in a case where the maximum distance is 2, it went all the
> way up to 14. :)
>
> It's only a distance conceptually but it's more like a "factor". They
> actually call it a "facility cost" rather than a distance, probably for
> this reason.
>

It has units of distance and thus should be invariant with respect to
uniform rescaling.  No way around that.  That means that it has to be based
on a measurement from the data.

And since it is a probabilistic bound, it is OK if it exceeds other
distances.  14 versus 2 does seem a bit high, but conceptually exceeding
the distance is not a problem.  I think that the 14 versus 2 problem is
something else.



> > >> Additionally, clusterOvershoot, the thing we're using to delay
> distance
> > >> cutoff increases also seems somewhat unnecessary. Why use it and get a
> > lot
> > >> more centroids than what we asked for.
> > >>
> > >
> > Well, it needs to be there until we get more run-time.  Getting more
> > centroids doesn't hurt and getting too few definitely does hurt.  Thus we
> > need some insurance.
>
>
> Well, it would get as many clusters as it needs to. And we increase
> numClusters as we go until it's clusterLogFactor * numProcessedDatapoints.
> I understand the point that we should get more rather than less centroids,
> but k log n (what we're getting at the end) seems fine.
>

The key is not the number at the end.  It is that the number never drops
below enough centroids.


> The estimate we give it at the beginning is only valid as long as not
> enough datapoints have been processed to go over k log n.
>

Are we talking about clusterOvershoot here?  Or the numClusters over-ride?

> I think we already do better than some of the guarantees in the paper.
> >
>
> I think we do the same, and I'm wondering whether the extra knobs do
> anything or not. I tried fiddling with them a bit and they didn't seem to
> do much.
> I also haven't worked through the math for our particular version.
>

Well, we have seen cases where the over-shoot needed to be >1.  Those may
have gone away with better adaptation, but I think that they probably still
can happen.

The distanceCutoff initialization is blatantly important if you try scaling
the test data by 1e-9.  Without adapting to the data, you will get exactly
one cluster.

Re: Streaming KMeans distance cutoff

Reply via email to