-----Original Message----- From: Pere Ferrera [mailto:[email protected]] Sent: Wednesday, January 05, 2011 4:37 AM To: [email protected] Subject: two-tiered clustering
Hello all, I have a clustering problem that involves clustering a huge data set. [jeff] I'm having some similar problems with my own data so let's collaborate to see if we can improve things. My knowledge of clustering techniques is very limited so I just tried a few algorithms. Because I can regression test them against a valid clustered set, I found out a configuration that gives best accuracy wiht no false positives (MeanShift Canopy, Tanimoto distance). However, because the number of output clusters is somehow linear to the number of documents to cluster, I had problems scaling out with MeanShift - because of the single Reducer step. I have read in the mailing list that MeanShift is good fit if the number of clusters grows logarithmically, not linearly with input size. [jeff] I keep being surprised that MeanShift Canopy performs as well as it does. I'm pretty sure there is no other implementation like it anywhere. I think one can increase the number of reducers early in the iterations to overcome some of the scalability limitations you cite, but the canopies gather a list of all their contained pointIds and this will suffer another scalability issue. There are ways with logging to eliminate this memory growth but I feel it is likely just whack-a-mole. Then I considered K-Means because it seems to scale well despite the number of clusters. Under some good K choice, I got some false positives but very good coverage. To avoid false positives, I have read I would need to choose a good initial configuration, but I can't apply Canopy before because it won't scale with a large number of clusters, and there's no K-means++ yet implemented. [jeff] K-Means works well too but we lack a good way to prime it with -k clusters. Canopy has scalability problems as you note and the random sampling works only so-so. We have a JIRA for K-means++. Then I thought that I could apply a multi-level technique. I could use K-Means with an approximated and pessimistic (low) number of clusters and use MeanShift inside each cluster to refine the results and avoid false positives. I didn't know if that was a good idea, but found a reference in "Mahout in Action" in page 215 to something similar (two-tiered clustering). Even though I have this solution working fine, I have the feeling that it got too complex. To try it, I programmed some custom code. In my current solution I have a job that executes with the output of a clustering and launches as many map tasks as clusters were generated, performing a second clustering inside them and accumulating the output of each mapper (so, no Reducer step). I was wondering if there is a better way to accomplish that. Is there something already implemented to execute a two-tiered (or n-tiered) clustering? I also thought that maybe you can show me a better idea to solve my problem. [jeff] Nothing is implemented out of the box for n-tiered clustering though I believe that the results of any clustering can be used as the prior for most of our iterative algorithms. Another way to approach the problem might be to try Dirichlet clustering to prime k-means. It is non-parametric and could give you an objective opinion on the best value of k for your data. Dirichlet can use DistanceMeasureModels, e.g. Tanimoto, Cosine, Manhattan, et. Al. Thanks, Pere.
