Hi,

in mahout 0.8 I see that ClusterOutputPostProcessorMapper and -Reducer are using Map<Integer, Integer> *ClusterMappings = ClusterCountReader.getClusterIDs(clusterOutputPath, conf, <true|false>).

This map alows to map clusterIds to index of 0 to k-1 where k is the number of clusters.

What is the purpose of this mapping?

clusterIds themselves are int thus the mapping to an index (and reverse mapping in Reducer back from index) seems to me useless.

Since clusterpp is setting number of reducers equal to k I thought initially this design is used to ensure that each cluster is given to a separate reducer but this should be true even without mapping.

What reducer gets as a key IF we are doind mapping is this: 0, 1, 2, 3, 4, 5, 6, ... Without mapping the reducer gets keys like this: 345, 37636, 14, 47699, 234576, ...

But the clustered points will still be shuffled by cluster id when passed to reducer.

So what gives?

Thank you, guys, for your hints
reinis.

Reply via email to