Canopy clustering

Aaron Kaplan Mon, 18 Jul 2011 16:21:37 -0700

The mahout wiki page for canopy clustering [1] links to the KDD 2000paper by McCallum et al. [2], but I read the paper and I read the mahoutcode and they don't seem to be doing the same thing at all. In thepaper, multiple seed prototypes are placed in each canopy, and eachpoint is compared only to the prototypes within its own canopies. Thefact that they avoid comparing every point with every prototype is thewhole point of the method. The two uses of canopy that I've looked at inthe mahout code are inorg.apache.mahout.clustering.syntheticcontrol.canopy.Job andorg.apache.mahout.clustering.syntheticcontrol.kmeans.Job, and as Iunderstand it they both just use the canopy centroids as seed prototypesfor a classical clustering approach, where every point gets comparedwith every prototype in each iteration. Is that accurate or have Imisread the code?


Thanks
-Aaron


[1] https://cwiki.apache.org/confluence/display/MAHOUT/Canopy+Clustering
[2] http://www.kamalnigam.com/papers/canopy-kdd00.pdf

Canopy clustering

Reply via email to