Re: [Canopy] Picking t1 and t2 was Re: [jira] Commented: (MAHOUT-121) Speed up distance calculations for sparse vectors

Grant Ingersoll Wed, 17 Jun 2009 06:22:46 -0700


On Jun 17, 2009, at 9:05 AM, Benson Margulies wrote:

All I know is what I learned from reading the paper. However, Icontinue tothink, from reading the paper, that you may be trying to make Canopydo
something it was not intended to do.
As I read the paper, the idea here is to get a rough partitioningthat isused to optimize various downstream algorithms, not to tune for aprecise
partitioning. The number of canopies doesn't need, as I read it, to be
particularly close to the number of eventual partitions to be useful.

Thus the extended discussion of how to start up and run various other
algorithms, (e.g. k-means).


Makes sense.

Now, still, you need to get some useful number of partitions. Thepaper hasa classic toss-off line, 'we used cross-validation,' without anydetailsabout exactly what the authors did. Presumably, that means that theauthorran many possible values and hand-examined the results. The paperreports nogeneral results about how sensitive the T values are to particularinputdata sets. A pessimist would fear that, for any new input, you'regoing toneed to go through a lengthy process to find good values for T1 andT2.
This leads me to wonder, ignorantly, why this project is so focused on
Canopy. The paper describes it as a tool for speeding up various other
things. Since you're hadooping all those other things, how much doesit
help?

I don't think anyone is solely focused on it, but it is something thatwe have available in our arsenal of clustering tools, therefore itwarrants documentation and understanding of when and how to use.Personally, it's just something I could easily run to work onMAHOUT-121.

At any rate, this kind of write up is exactly the advice that we needto be able to give people. Care to add to http://cwiki.apache.org/confluence/display/MAHOUT/ClusteringYourData?


Anyway, I expect that my ignorance is on comprehensive display here.

Funny, I feel like my ignorance is the one on display, but that issomething I got over a long time ago in open source. Which is why Ijust come out and ask the questions! One of my goals for Mahout is tomake it a place where people can come and learn about Machine Learningand get practical advice and not be afraid to ask basic questions.Machine learning is so shrouded in mystery it almost seems like a DarkArt. I'm thankful every day on this project that smarter people thanme show up and answer questions. So, please, keep 'em coming!


-Grant

Re: [Canopy] Picking t1 and t2 was Re: [jira] Commented: (MAHOUT-121) Speed up distance calculations for sparse vectors

Reply via email to