Re: Streaming KMeans distance cutoff

Dan Filimon Wed, 08 May 2013 10:29:14 -0700

On Wed, May 8, 2013 at 8:09 PM, Ted Dunning <ted.dunn...@gmail.com> wrote:


> On Wed, May 8, 2013 at 10:00 AM, Dan Filimon <dangeorge.fili...@gmail.com
> >wrote:
>
> > On Wed, May 8, 2013 at 7:48 PM, Ted Dunning <ted.dunn...@gmail.com>
> wrote:
> >
> > > Inline
> > >
> > >
> > > >> He told me two things:
> > > >> - that we should multiply the distance / distanceCutoff ratio by the
> > > >> weight
> > > >> of the point we're trying to cluster so as to avoid collapsing
> larger
> > > >> clusters
> > > >>
> > > >
> > > This makes sense, but I have no idea what the effect will be.
> > >
> >
> > I think it avoids the need of the special way we handle the increase of
> > distanceCutoff by beta in another if.
> >
>
> Sure.  Sounds right and all.
>
> But experiment will tell better.
>
>
> > > >  - the initial cutoff they use is 1 / numClusters basically
> > > >>
> > > >
> > > This is just wrong.
> > >
> > > It is fine in theoretical settings with synthetic clusters of unit
> > > characteristic scale, but is not invariant over uniform scaling so it
> > can't
> > > be correct.
> >
> >
> > But the thing is it's not really a distance at all. I've seen it increase
> > to far beyond what a maximum distance between clusters should be.
> > So, for example, in a case where the maximum distance is 2, it went all
> the
> > way up to 14. :)
> >
> > It's only a distance conceptually but it's more like a "factor". They
> > actually call it a "facility cost" rather than a distance, probably for
> > this reason.
> >
>
> It has units of distance and thus should be invariant with respect to
> uniform rescaling.  No way around that.  That means that it has to be based
> on a measurement from the data.
>
> And since it is a probabilistic bound, it is OK if it exceeds other
> distances.  14 versus 2 does seem a bit high, but conceptually exceeding
> the distance is not a problem.  I think that the 14 versus 2 problem is
> something else.
>

Okay, yes, it should be a distance. You're right that especially given that
it's in a ratio, the result should be unit-less.
I'm uncomfortable with the distanceCutoff growing too high, but I'll just
put the blame on that one on the data.

StreamingKMeans + BallKMeans gave good results compared to Mahout KMeans on
other data sets (similar kinds of clusters and good looking Dunn and
Davies-Bouldin indices).


> > > >> Additionally, clusterOvershoot, the thing we're using to delay
> > distance
> > > >> cutoff increases also seems somewhat unnecessary. Why use it and
> get a
> > > lot
> > > >> more centroids than what we asked for.
> > > >>
> > > >
> > > Well, it needs to be there until we get more run-time.  Getting more
> > > centroids doesn't hurt and getting too few definitely does hurt.  Thus
> we
> > > need some insurance.
> >
> >
> > Well, it would get as many clusters as it needs to. And we increase
> > numClusters as we go until it's clusterLogFactor *
> numProcessedDatapoints.
> > I understand the point that we should get more rather than less
> centroids,
> > but k log n (what we're getting at the end) seems fine.
> >
>
> The key is not the number at the end.  It is that the number never drops
> below enough centroids.


Hmm, I guess you're referring to some actual cases of this that you've
observed. On the data I tried, it seemed like collapsing did fairly little
to reduce the number of clusters, especially without increasing the
distance cutoff.

So this would happen when the distance cutoff would grow too big. Still, I
think this is probably taken care of by factoring in the weight of the
point when computing the ratio. That is there so two big clusters don't
merge when they shouldn't.


>
> > The estimate we give it at the beginning is only valid as long as not
> > enough datapoints have been processed to go over k log n.
> >
>
> Are we talking about clusterOvershoot here?  Or the numClusters over-ride?


We collapse the clusters when the number of actual centroids is over
clusterOvershoot * numClusters.
I'm thinking that since numClusters increases anyway, clusterOvershoot
means we end up with more clusters than we need (not bad per se, but trying
to get rid of variables).


>  > I think we already do better than some of the guarantees in the paper.
> > >
> >
> > I think we do the same, and I'm wondering whether the extra knobs do
> > anything or not. I tried fiddling with them a bit and they didn't seem to
> > do much.
> > I also haven't worked through the math for our particular version.
> >
>
> Well, we have seen cases where the over-shoot needed to be >1.  Those may
> have gone away with better adaptation, but I think that they probably still
> can happen.
>

Sorry, what do you mean by adaptation here?


> The distanceCutoff initialization is blatantly important if you try scaling
> the test data by 1e-9.  Without adapting to the data, you will get exactly
> one cluster.
>

Okay, the nail is firmly in the coffin of that idea. :)

Re: Streaming KMeans distance cutoff

Reply via email to