On Wed, May 8, 2013 at 10:28 AM, Dan Filimon <dangeorge.fili...@gmail.com>wrote:

> > > I think it avoids the need of the special way we handle the increase of
> > > distanceCutoff by beta in another if.
> > >
> >
> > Sure.  Sounds right and all.
> >
> > But experiment will tell better.
>

yes.

But I definitely saw cases where the same cutoff caused the centroid count
to decrease.  In my mind, continuing to increase the cutoff in those cases
is a bad thing.  A smaller cutoff is more conservative in that it will
preserve more data in the sketch.  Until we see it preserving too much
data, we don't need to increase the cutoff.


> > > ... They
> > > actually call it a "facility cost" rather than a distance, probably for
> > > this reason.
>

Btw... the reason that they call it a facility cost is because they are
referring to a different literature.  With k-means, k is traditionally
fixed.  With facility assignment, it is traditionally not.  The problems
are otherwise quite similar.  The reason for the difference in nomenclature
is because the facility assignment stuff comes from operations research,
not computer science.

... I'm uncomfortable with the distanceCutoff growing too high, but I'll
> just
> put the blame on that one on the data.
>

I am uncomfortable as well.

This is one reason I would like to only increase the distanceCutoff when a
small value proves ineffective.


> StreamingKMeans + BallKMeans gave good results compared to Mahout KMeans on
> other data sets (similar kinds of clusters and good looking Dunn and
> Davies-Bouldin indices).
>

You hide this gem in a long email!!!

Good news.


> >
>
> > The estimate we give it at the beginning is only valid as long as not
> > > enough datapoints have been processed to go over k log n.
> > >
> >
> > Are we talking about clusterOvershoot here?  Or the numClusters
> over-ride?
>
>
> We collapse the clusters when the number of actual centroids is over
> clusterOvershoot * numClusters.
> I'm thinking that since numClusters increases anyway, clusterOvershoot
> means we end up with more clusters than we need (not bad per se, but trying
> to get rid of variables).
>

I view it as numClusters is the minimum number of clusters that we want to
see.  ClusterOverShoot says that we can go a ways above the minimum, but we
hopefully will just collapse back down to the minimum or above.



> > Well, we have seen cases where the over-shoot needed to be >1.  Those may
> > have gone away with better adaptation, but I think that they probably
> still
> > can happen.
> >
>
> Sorry, what do you mean by adaptation here?
>

Better adjustment and use of the distanceCutoff.  This should make the
collapse in the recursive clustering be less dramatic and more predictable.
 That will make the system require less over-shoot.

Reply via email to