On Jun 17, 2009, at 9:05 AM, Benson Margulies wrote:
All I know is what I learned from reading the paper. However, I
continue to
think, from reading the paper, that you may be trying to make Canopy
do
something it was not intended to do.
As I read the paper, the idea here is to get a rough partitioning
that is
used to optimize various downstream algorithms, not to tune for a
precise
partitioning. The number of canopies doesn't need, as I read it, to be
particularly close to the number of eventual partitions to be useful.
Thus the extended discussion of how to start up and run various other
algorithms, (e.g. k-means).
Makes sense.
Now, still, you need to get some useful number of partitions. The
paper has
a classic toss-off line, 'we used cross-validation,' without any
details
about exactly what the authors did. Presumably, that means that the
author
ran many possible values and hand-examined the results. The paper
reports no
general results about how sensitive the T values are to particular
input
data sets. A pessimist would fear that, for any new input, you're
going to
need to go through a lengthy process to find good values for T1 and
T2.
This leads me to wonder, ignorantly, why this project is so focused on
Canopy. The paper describes it as a tool for speeding up various other
things. Since you're hadooping all those other things, how much does
it
help?
I don't think anyone is solely focused on it, but it is something that
we have available in our arsenal of clustering tools, therefore it
warrants documentation and understanding of when and how to use.
Personally, it's just something I could easily run to work on
MAHOUT-121.
At any rate, this kind of write up is exactly the advice that we need
to be able to give people. Care to add to http://cwiki.apache.org/confluence/display/MAHOUT/ClusteringYourData
?
Anyway, I expect that my ignorance is on comprehensive display here.
Funny, I feel like my ignorance is the one on display, but that is
something I got over a long time ago in open source. Which is why I
just come out and ask the questions! One of my goals for Mahout is to
make it a place where people can come and learn about Machine Learning
and get practical advice and not be afraid to ask basic questions.
Machine learning is so shrouded in mystery it almost seems like a Dark
Art. I'm thankful every day on this project that smarter people than
me show up and answer questions. So, please, keep 'em coming!
-Grant