Hi Dan,

Sure. I took a quick look just now and it looks good. Did you notice that
shuffling before collapsing was helping, hence keeping it in? It didn't
make much difference for me.

Andy



On 9 May 2013 16:05, Dan Filimon <[email protected]> wrote:

> Andy, would you like to review the final version of the clustering code
> before it goes in [1]?
> [1] https://reviews.apache.org/r/10194/
>
> Ted, it's pretty much done. Okay it and I'll commit.
>
>
> On Wed, May 8, 2013 at 11:57 PM, Ted Dunning <[email protected]>wrote:
>
>> On Wed, May 8, 2013 at 10:28 AM, Dan Filimon <[email protected]
>> >wrote:
>>
>> > > > I think it avoids the need of the special way we handle the
>> increase of
>> > > > distanceCutoff by beta in another if.
>> > > >
>> > >
>> > > Sure.  Sounds right and all.
>> > >
>> > > But experiment will tell better.
>> >
>>
>> yes.
>>
>> But I definitely saw cases where the same cutoff caused the centroid count
>> to decrease.  In my mind, continuing to increase the cutoff in those cases
>> is a bad thing.  A smaller cutoff is more conservative in that it will
>> preserve more data in the sketch.  Until we see it preserving too much
>> data, we don't need to increase the cutoff.
>>
>
> I kept the overshoot just to be safe in the CL.
>
> > > > ... They
>> > > > actually call it a "facility cost" rather than a distance, probably
>> for
>> > > > this reason.
>> >
>>
>> Btw... the reason that they call it a facility cost is because they are
>> referring to a different literature.  With k-means, k is traditionally
>> fixed.  With facility assignment, it is traditionally not.  The problems
>> are otherwise quite similar.  The reason for the difference in
>> nomenclature
>> is because the facility assignment stuff comes from operations research,
>> not computer science.
>>
>
> Ah, well that explains it. :)
>
> ... I'm uncomfortable with the distanceCutoff growing too high, but I'll
>> > just
>> > put the blame on that one on the data.
>> >
>>
>> I am uncomfortable as well.
>>
>> This is one reason I would like to only increase the distanceCutoff when a
>> small value proves ineffective.
>
>
> Alright, this is the version that's going in.
>
>
>>  > StreamingKMeans + BallKMeans gave good results compared to Mahout
>> KMeans on
>> > other data sets (similar kinds of clusters and good looking Dunn and
>> > Davies-Bouldin indices).
>> >
>>
>> You hide this gem in a long email!!!
>>
>> Good news.
>
>
> Yeah. :)
> It's comparable to Mahout KMeans quality wise, and very tweakable.
> The speed improvements should be apparent on large data sets that we run
> on Hadoop.
>
> > >
>> >
>> > > The estimate we give it at the beginning is only valid as long as not
>> > > > enough datapoints have been processed to go over k log n.
>> > > >
>> > >
>> > > Are we talking about clusterOvershoot here?  Or the numClusters
>> > over-ride?
>> >
>> >
>> > We collapse the clusters when the number of actual centroids is over
>> > clusterOvershoot * numClusters.
>> > I'm thinking that since numClusters increases anyway, clusterOvershoot
>> > means we end up with more clusters than we need (not bad per se, but
>> trying
>> > to get rid of variables).
>> >
>>
>> I view it as numClusters is the minimum number of clusters that we want to
>> see.  ClusterOverShoot says that we can go a ways above the minimum, but
>> we
>> hopefully will just collapse back down to the minimum or above.
>>
>>
>>
>> > > Well, we have seen cases where the over-shoot needed to be >1.  Those
>> may
>> > > have gone away with better adaptation, but I think that they probably
>> > still
>> > > can happen.
>> > >
>> >
>> > Sorry, what do you mean by adaptation here?
>> >
>>
>> Better adjustment and use of the distanceCutoff.  This should make the
>> collapse in the recursive clustering be less dramatic and more
>> predictable.
>>  That will make the system require less over-shoot.
>>
>
>


-- 
Dr Andy Twigg
Junior Research Fellow, St Johns College, Oxford
Room 351, Department of Computer Science
http://www.cs.ox.ac.uk/people/andy.twigg/
[email protected] | +447799647538

Reply via email to