[ https://issues.apache.org/jira/browse/MAHOUT-1083?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Jeff Eastman updated MAHOUT-1083: --------------------------------- Attachment: MAHOUT-1083.patch Oh Yikes! This is another example of the Hadoop platform reusing a single instance of the Writable every time another one is read from the input stream. Every time iter.next() is called, the same object (cw == first) will be returned. This does not show up in unit tests because only a single reducer is ever used. It explains exactly the odd clustering behavior you have observed. Here's a patch that *may* fix the problem. I believe that ClusterWritable will create a new instance for its value instVar even if the same ClusterWritable is used. Please try this and see if it resolves the issue. > CIReducer in kmeans doesn't work well > ------------------------------------- > > Key: MAHOUT-1083 > URL: https://issues.apache.org/jira/browse/MAHOUT-1083 > Project: Mahout > Issue Type: Bug > Environment: hadoop-2.0.0-alpha: pseudo cluster and single node > cluster hadoop-1.0.3: pseudo cluster hadoop-0.20.2:pseudo cluster > mahout: mahout-0.7 os: ubuntu 11.04 jdk: jdk1.6.0_27 > Reporter: liutengfei > Attachments: MAHOUT-1083.patch > > > the function "reduce" in mahout-0.7-kmeans-CIReducer.java doesn't work well > as it looks like. > protected void reduce(IntWritable key, Iterable<ClusterWritable> values, > Context context) throws IOException, > InterruptedException { > Iterator<ClusterWritable> iter = values.iterator(); > ClusterWritable first = null; > while (iter.hasNext()) { > ClusterWritable cw = iter.next(); > if (first == null) { > first = cw; > } else { > first.getValue().observe(cw.getValue()); > } > } > List<Cluster> models = new ArrayList<Cluster>(); > models.add(first.getValue()); > classifier = new ClusterClassifier(models, policy); > classifier.close(); > context.write(key, first); > } > Apparently, the variable "first" will collect all output data of maps. > Actually but, the value of "first" will change after the code > "ClusterWritable cw = iter.next();", same with this new variable "cw"! I > don't why but running result shows that the code runs looks like > this:"ClusterWritable cw = first = iter.next();". > is "cw" a reference a to "iter"? > is "iter.next" just change the value of "iter" itself to the next? -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira