ClusterOutputPostProcessor: what is the purpose of clusterMappings

Reinis Vicups Wed, 14 May 2014 05:47:28 -0700

Hi,

in mahout 0.8 I see that ClusterOutputPostProcessorMapper and -Reducerare using Map<Integer, Integer> *ClusterMappings =ClusterCountReader.getClusterIDs(clusterOutputPath, conf, <true|false>).

This map alows to map clusterIds to index of 0 to k-1 where k is thenumber of clusters.


What is the purpose of this mapping?

clusterIds themselves are int thus the mapping to an index (and reversemapping in Reducer back from index) seems to me useless.

Since clusterpp is setting number of reducers equal to k I thoughtinitially this design is used to ensure that each cluster is given to aseparate reducer but this should be true even without mapping.

What reducer gets as a key IF we are doind mapping is this: 0, 1, 2, 3,4, 5, 6, ...Without mapping the reducer gets keys like this: 345, 37636, 14, 47699,234576, ...

But the clustered points will still be shuffled by cluster id whenpassed to reducer.


So what gives?

Thank you, guys, for your hints
reinis.

ClusterOutputPostProcessor: what is the purpose of clusterMappings

Reply via email to