Re: KMeansJob vs KMeansDriver

Grant Ingersoll Fri, 26 Jun 2009 11:28:02 -0700

Check out the patch I just put up on M-138

On Jun 26, 2009, at 12:32 PM, Jeff Eastman wrote:

Grant Ingersoll wrote:
Isn't the KMeansJob pretty much redundant, assuming we add aparameter to KMeansDriver to take in the number of reduce tasks?
The purpose of the clustering jobs, in general, was to simplifycomputing the clusters and then clustering the data. It has beenapplied - and changed - inconsistently over the variousimplementations and some cleanup is warranted. It seems to me thathaving a job to do both steps is still valuable, though (as in theearlier kmeans synthetic control example) it may do the pointclustering unnecessarily if it is blindly used as only entry point.
I don't currently see how specifying the 'k' value explicitly canwork in the current job and it is unrelated to the number ofreducers. The 'k' value comes from the initial number of clusters. Ithink the implementation can use any number of reducers up to 'k'but don't recall seeing a test for that. One could add a job stepthat picks 'k' random centers from the data - as in your previousthreads - and that job/driver would need to know 'k'. See below.
For consistency, it seems to me that all the clustering jobs shoulduniformly facilitate these actions:
0. Set the initial clustering state
1. Compute a set of clusters given the input data points and theinitial clustering state2. Optionally cluster the input data points by assigning them toclusters. This would be with probabilities in the case ofFuzzyKMeans and Dirichlet or one might just desire the most likelycluster.
Canopy has no initial clustering state. For KMeans, this can becomputed via running Canopy on the data or by selecting 'k' randompoints from the data, or by some other heuristic (un)related to thedata. For Dirichlet, it is by sampling from the prior of theModelDistribution; for MeanShift every input data point creates aninitial canopy.
(The various jobs, drivers and output directory structures producedby the different algorithms need to be cleaned up and made moreconsistent, IMO)
Also, the variable naming in KMeansJob that the number of reducetasks (numCentroids) is actually the "k" in k-Means, even if thisvalue is currently fixed at 2 if using KMeansDriver? I'm trying tomake arg handling easier for MAHOUT-138.
I thought I had already committed a change to rename this argumentnumReduceTasks so as to be consistent with its application inKMeansDriver.
Jeff

Re: KMeansJob vs KMeansDriver

Reply via email to