These sorts of optimizations could delay the growth of canopy clusters
in situations where the clustering thresholds are set too low for the
dataset. At some point the mapper would still OME with enough points if
all become clusters. That decision rests with the T2 threshold which
determines if a point is "closely bound" to a pre-existing canopy. If T2
is set too small for the dataset then all points will become canopies
and optimizations will only delay the inevitable.
I have read that there are estimating techniques that one can run on a
new dataset that can provide guidance on initial threshold values. I'll
keep looking to see if I can find a reference.
On 5/2/10 2:06 PM, Ted Dunning wrote:
How about making the threshold adapt over time?
Another option is to keep a count of all of the canopies so far and evict
any which have too few points with too large average distance. The points
emitted so far would still reference these canopies, but we wouldn't be able
to add new points to these canopies.
The number of canopies should grow with the amount of data, but slowly. Log
N or slower is probably about right. Clever adjustment of t1 could enforce
this and eviction of early canopies that were accepted with a small
threshold could avoid problems with the transients of the adaptation.
On Sun, May 2, 2010 at 5:19 AM, Robin Anil<robin.a...@gmail.com> wrote:
Algorithm is simple
For each point read into the mapper.
Find the canopy it is closest to(from memory List<>) and add it
to the canopy.
Else if the distance is greater than a threshold t1 then create a
new canopy(into memory List<>)