Check out the patch I just put up on M-138
On Jun 26, 2009, at 12:32 PM, Jeff Eastman wrote:
Grant Ingersoll wrote:
Isn't the KMeansJob pretty much redundant, assuming we add a
parameter to KMeansDriver to take in the number of reduce tasks?
The purpose of the clustering jobs, in general, was to simplify
computing the clusters and then clustering the data. It has been
applied - and changed - inconsistently over the various
implementations and some cleanup is warranted. It seems to me that
having a job to do both steps is still valuable, though (as in the
earlier kmeans synthetic control example) it may do the point
clustering unnecessarily if it is blindly used as only entry point.
I don't currently see how specifying the 'k' value explicitly can
work in the current job and it is unrelated to the number of
reducers. The 'k' value comes from the initial number of clusters. I
think the implementation can use any number of reducers up to 'k'
but don't recall seeing a test for that. One could add a job step
that picks 'k' random centers from the data - as in your previous
threads - and that job/driver would need to know 'k'. See below.
For consistency, it seems to me that all the clustering jobs should
uniformly facilitate these actions:
0. Set the initial clustering state
1. Compute a set of clusters given the input data points and the
initial clustering state
2. Optionally cluster the input data points by assigning them to
clusters. This would be with probabilities in the case of
FuzzyKMeans and Dirichlet or one might just desire the most likely
cluster.
Canopy has no initial clustering state. For KMeans, this can be
computed via running Canopy on the data or by selecting 'k' random
points from the data, or by some other heuristic (un)related to the
data. For Dirichlet, it is by sampling from the prior of the
ModelDistribution; for MeanShift every input data point creates an
initial canopy.
(The various jobs, drivers and output directory structures produced
by the different algorithms need to be cleaned up and made more
consistent, IMO)
Also, the variable naming in KMeansJob that the number of reduce
tasks (numCentroids) is actually the "k" in k-Means, even if this
value is currently fixed at 2 if using KMeansDriver? I'm trying to
make arg handling easier for MAHOUT-138.
I thought I had already committed a change to rename this argument
numReduceTasks so as to be consistent with its application in
KMeansDriver.
Jeff