[ https://issues.apache.org/jira/browse/MAHOUT-1084?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Grant Ingersoll resolved MAHOUT-1084. ------------------------------------- Resolution: Fixed Thanks liutengfei! > Kmeans for synthetic control example--there are 12 cluster during iterations. > ----------------------------------------------------------------------------- > > Key: MAHOUT-1084 > URL: https://issues.apache.org/jira/browse/MAHOUT-1084 > Project: Mahout > Issue Type: Bug > Reporter: liutengfei > Assignee: Grant Ingersoll > Fix For: 0.8 > > > In Mahout-Kmeans for syntheticcontrol example, using the default > parameters means to compute 6 clusters at last. But why there are 12 clusters > during Kmeans iterations. According to my observation, the former 6 clusters > and the latter 6 clusters are the same before the first iteration,those 6 > clusters are generatored by RandomSeedGenerator.java. Then the CIMapper will > assign its own points to this 12 clusters. Is here existing logical errors? > The 12 clusters are created by the function "setup" in CIMapper.java, > more specifically, is the line "classifier.readFromSeqFiles(conf, new > Path(priorClustersPath));", here the "priorClustersPath" means hdfs direction > "output/clusters-0/", there are 8 files in this direction: > "_policy","part-randomSeed"(one file record six cluster),"part-00000" to > "part-00005"(total six files,every one record a cluster), while reading this > direction, "_policy" will be filtered out, so program will read "part-00000" > to "part-00005" to create six clusters, then read "part-randomSeed" to create > the other six clusters, this is the reason why there will be 12 clusters > before first iteration. > Solution: delete associated code to avoid duplicately creating clusters > in "output/clusters-0/", here i delete codes where create files: "part-00000" > to "part-00005" in ClusterClassfier.java: > public void writeToSeqFiles(Path path) throws IOException { > writePolicy(policy, path); > /* > Configuration config = new Configuration(); > FileSystem fs = FileSystem.get(path.toUri(), config); > SequenceFile.Writer writer = null; > ClusterWritable cw = new ClusterWritable(); > for (int i = 0; i < models.size(); i++) { > try { > Cluster cluster = models.get(i); > cw.setValue(cluster); > writer = new SequenceFile.Writer(fs, config, > new Path(path, "part-" + String.format(Locale.ENGLISH, "%05d", > i)), IntWritable.class, > ClusterWritable.class); > Writable key = new IntWritable(i); > writer.append(key, cw); > } finally { > Closeables.closeQuietly(writer); > } > } > */ > } > I don't know if it is still okay for other progams who using this file, > but for KMeans in Syntheticcontrol example, program will create 6 clusters > during every iterations as i expected. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira