The mahout wiki page for canopy clustering [1] links to the KDD 2000 paper by McCallum et al. [2], but I read the paper and I read the mahout code and they don't seem to be doing the same thing at all. In the paper, multiple seed prototypes are placed in each canopy, and each point is compared only to the prototypes within its own canopies. The fact that they avoid comparing every point with every prototype is the whole point of the method. The two uses of canopy that I've looked at in the mahout code are in org.apache.mahout.clustering.syntheticcontrol.canopy.Job and org.apache.mahout.clustering.syntheticcontrol.kmeans.Job, and as I understand it they both just use the canopy centroids as seed prototypes for a classical clustering approach, where every point gets compared with every prototype in each iteration. Is that accurate or have I misread the code?

Thanks
-Aaron

[1] https://cwiki.apache.org/confluence/display/MAHOUT/Canopy+Clustering
[2] http://www.kamalnigam.com/papers/canopy-kdd00.pdf

Reply via email to