Hi Neil I had a similar problem and managed to solve it for me. See http://comments.gmane.org/gmane.comp.apache.mahout.user/10228.
R ________________________________ From: Neil Chaudhuri <[email protected]> To: "[email protected]" <[email protected]> Sent: Tuesday, 6 December 2011, 0:02 Subject: MeanShiftCanopyDriver Output I am attempting to programmatically run MeanShiftCanopyDriver. I found this note about the output: After running the algorithm, the output directory will contain: 1. clusters-N: directories containing SequenceFiles(Text, MeanShiftCanopy) produced by the algorithm for each iteration. The Text key is a cluster identifier string. 2. clusteredPoints: (if runClustering enabled) a directory containing SequenceFile(IntWritable, WeightedVectorWritable). The IntWritable key is the clusterId. The WeightedVectorWritable value is a bean containing a double weight and a VectorWritable vector where the weight indicates the probability that the vector is a member of the cluster. As Mean Shift only produces a single clustering for each point, the weights are all == 1. It seems like I can only expect to find a sequence file of clusterIds mapped to Vectors. I am lost as to where I can find a reference (perhaps an id) to the original documents being clustered. In other words, how can I map the output back to the input? Thanks.
