Re: Streaming KMeans distance cutoff

Dan Filimon Thu, 09 May 2013 12:22:30 -0700

I haven't noticed, but it makes me feel somewhat (irrationally :) better
knowing that the points don't come through in the same order they
previously came in.
I thought of maybe having a flag, but I'm kind of split on the issue.


Even if they aren't shuffled, we need to copy them to another list before
collapsing anyway so we'd still be looping through them once.


On Thu, May 9, 2013 at 10:09 PM, Andy Twigg <[email protected]> wrote:

> Hi Dan,
>
> Sure. I took a quick look just now and it looks good. Did you notice that
> shuffling before collapsing was helping, hence keeping it in? It didn't
> make much difference for me.
>
> Andy
>
>
>
> On 9 May 2013 16:05, Dan Filimon <[email protected]> wrote:
>
>> Andy, would you like to review the final version of the clustering code
>> before it goes in [1]?
>> [1] https://reviews.apache.org/r/10194/
>>
>> Ted, it's pretty much done. Okay it and I'll commit.
>>
>>
>> On Wed, May 8, 2013 at 11:57 PM, Ted Dunning <[email protected]>wrote:
>>
>>> On Wed, May 8, 2013 at 10:28 AM, Dan Filimon <
>>> [email protected]>wrote:
>>>
>>> > > > I think it avoids the need of the special way we handle the
>>> increase of
>>> > > > distanceCutoff by beta in another if.
>>> > > >
>>> > >
>>> > > Sure.  Sounds right and all.
>>> > >
>>> > > But experiment will tell better.
>>> >
>>>
>>> yes.
>>>
>>> But I definitely saw cases where the same cutoff caused the centroid
>>> count
>>> to decrease.  In my mind, continuing to increase the cutoff in those
>>> cases
>>> is a bad thing.  A smaller cutoff is more conservative in that it will
>>> preserve more data in the sketch.  Until we see it preserving too much
>>> data, we don't need to increase the cutoff.
>>>
>>
>> I kept the overshoot just to be safe in the CL.
>>
>> > > > ... They
>>> > > > actually call it a "facility cost" rather than a distance,
>>> probably for
>>> > > > this reason.
>>> >
>>>
>>> Btw... the reason that they call it a facility cost is because they are
>>> referring to a different literature.  With k-means, k is traditionally
>>> fixed.  With facility assignment, it is traditionally not.  The problems
>>> are otherwise quite similar.  The reason for the difference in
>>> nomenclature
>>> is because the facility assignment stuff comes from operations research,
>>> not computer science.
>>>
>>
>> Ah, well that explains it. :)
>>
>> ... I'm uncomfortable with the distanceCutoff growing too high, but I'll
>>> > just
>>> > put the blame on that one on the data.
>>> >
>>>
>>> I am uncomfortable as well.
>>>
>>> This is one reason I would like to only increase the distanceCutoff when
>>> a
>>> small value proves ineffective.
>>
>>
>> Alright, this is the version that's going in.
>>
>>
>>>  > StreamingKMeans + BallKMeans gave good results compared to Mahout
>>> KMeans on
>>> > other data sets (similar kinds of clusters and good looking Dunn and
>>> > Davies-Bouldin indices).
>>> >
>>>
>>> You hide this gem in a long email!!!
>>>
>>> Good news.
>>
>>
>> Yeah. :)
>> It's comparable to Mahout KMeans quality wise, and very tweakable.
>> The speed improvements should be apparent on large data sets that we run
>> on Hadoop.
>>
>> > >
>>> >
>>> > > The estimate we give it at the beginning is only valid as long as not
>>> > > > enough datapoints have been processed to go over k log n.
>>> > > >
>>> > >
>>> > > Are we talking about clusterOvershoot here?  Or the numClusters
>>> > over-ride?
>>> >
>>> >
>>> > We collapse the clusters when the number of actual centroids is over
>>> > clusterOvershoot * numClusters.
>>> > I'm thinking that since numClusters increases anyway, clusterOvershoot
>>> > means we end up with more clusters than we need (not bad per se, but
>>> trying
>>> > to get rid of variables).
>>> >
>>>
>>> I view it as numClusters is the minimum number of clusters that we want
>>> to
>>> see.  ClusterOverShoot says that we can go a ways above the minimum, but
>>> we
>>> hopefully will just collapse back down to the minimum or above.
>>>
>>>
>>>
>>> > > Well, we have seen cases where the over-shoot needed to be >1.
>>>  Those may
>>> > > have gone away with better adaptation, but I think that they probably
>>> > still
>>> > > can happen.
>>> > >
>>> >
>>> > Sorry, what do you mean by adaptation here?
>>> >
>>>
>>> Better adjustment and use of the distanceCutoff.  This should make the
>>> collapse in the recursive clustering be less dramatic and more
>>> predictable.
>>>  That will make the system require less over-shoot.
>>>
>>
>>
>
>
> --
> Dr Andy Twigg
> Junior Research Fellow, St Johns College, Oxford
> Room 351, Department of Computer Science
> http://www.cs.ox.ac.uk/people/andy.twigg/
> [email protected] | +447799647538
>

Re: Streaming KMeans distance cutoff

Reply via email to