Hello all, I have a clustering problem that involves clustering a huge data set.
My knowledge of clustering techniques is very limited so I just tried a few algorithms. Because I can regression test them against a valid clustered set, I found out a configuration that gives best accuracy wiht no false positives (MeanShift Canopy, Tanimoto distance). However, because the number of output clusters is somehow linear to the number of documents to cluster, I had problems scaling out with MeanShift - because of the single Reducer step. I have read in the mailing list that MeanShift is good fit if the number of clusters grows logarithmically, not linearly with input size. Then I considered K-Means because it seems to scale well despite the number of clusters. Under some good K choice, I got some false positives but very good coverage. To avoid false positives, I have read I would need to choose a good initial configuration, but I can't apply Canopy before because it won't scale with a large number of clusters, and there's no K-means++ yet implemented. Then I thought that I could apply a multi-level technique. I could use K-Means with an approximated and pessimistic (low) number of clusters and use MeanShift inside each cluster to refine the results and avoid false positives. I didn't know if that was a good idea, but found a reference in "Mahout in Action" in page 215 to something similar (two-tiered clustering). Even though I have this solution working fine, I have the feeling that it got too complex. To try it, I programmed some custom code. In my current solution I have a job that executes with the output of a clustering and launches as many map tasks as clusters were generated, performing a second clustering inside them and accumulating the output of each mapper (so, no Reducer step). I was wondering if there is a better way to accomplish that. Is there something already implemented to execute a two-tiered (or n-tiered) clustering? I also thought that maybe you can show me a better idea to solve my problem. Thanks, Pere.
