Re: Canopy Clustering not scaling

Jeff Eastman Sun, 02 May 2010 17:10:56 -0700

These sorts of optimizations could delay the growth of canopy clustersin situations where the clustering thresholds are set too low for thedataset. At some point the mapper would still OME with enough points ifall become clusters. That decision rests with the T2 threshold whichdetermines if a point is "closely bound" to a pre-existing canopy. If T2is set too small for the dataset then all points will become canopiesand optimizations will only delay the inevitable.

I have read that there are estimating techniques that one can run on anew dataset that can provide guidance on initial threshold values. I'llkeep looking to see if I can find a reference.


On 5/2/10 2:06 PM, Ted Dunning wrote:

How about making the threshold adapt over time?

Another option is to keep a count of all of the canopies so far and evict
any which have too few points with too large average distance.  The points
emitted so far would still reference these canopies, but we wouldn't be able
to add new points to these canopies.

The number of canopies should grow with the amount of data, but slowly.  Log
N or slower is probably about right.  Clever adjustment of t1 could enforce
this and eviction of early canopies that were accepted with a small
threshold could avoid problems with the transients of the adaptation.

On Sun, May 2, 2010 at 5:19 AM, Robin Anil<robin.a...@gmail.com>  wrote:

Algorithm is simple
For each point read into the mapper.
           Find the canopy it is closest to(from memory List<>) and add it
to the canopy.
           Else if the distance is greater than a threshold t1 then create a
new canopy(into memory List<>)

Re: Canopy Clustering not scaling

Reply via email to