On Jun 17, 2009, at 9:05 AM, Benson Margulies wrote:

All I know is what I learned from reading the paper. However, I continue to think, from reading the paper, that you may be trying to make Canopy do
something it was not intended to do.

As I read the paper, the idea here is to get a rough partitioning that is used to optimize various downstream algorithms, not to tune for a precise
partitioning. The number of canopies doesn't need, as I read it, to be
particularly close to the number of eventual partitions to be useful.

Thus the extended discussion of how to start up and run various other
algorithms, (e.g. k-means).

Makes sense.


Now, still, you need to get some useful number of partitions. The paper has a classic toss-off line, 'we used cross-validation,' without any details about exactly what the authors did. Presumably, that means that the author ran many possible values and hand-examined the results. The paper reports no general results about how sensitive the T values are to particular input data sets. A pessimist would fear that, for any new input, you're going to need to go through a lengthy process to find good values for T1 and T2.

This leads me to wonder, ignorantly, why this project is so focused on
Canopy. The paper describes it as a tool for speeding up various other
things. Since you're hadooping all those other things, how much does it
help?

I don't think anyone is solely focused on it, but it is something that we have available in our arsenal of clustering tools, therefore it warrants documentation and understanding of when and how to use. Personally, it's just something I could easily run to work on MAHOUT-121.

At any rate, this kind of write up is exactly the advice that we need to be able to give people. Care to add to http://cwiki.apache.org/confluence/display/MAHOUT/ClusteringYourData ?



Anyway, I expect that my ignorance is on comprehensive display here.


Funny, I feel like my ignorance is the one on display, but that is something I got over a long time ago in open source. Which is why I just come out and ask the questions! One of my goals for Mahout is to make it a place where people can come and learn about Machine Learning and get practical advice and not be afraid to ask basic questions. Machine learning is so shrouded in mystery it almost seems like a Dark Art. I'm thankful every day on this project that smarter people than me show up and answer questions. So, please, keep 'em coming!

-Grant

Reply via email to