[ https://issues.apache.org/jira/browse/MAHOUT-1181?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13617273#comment-13617273 ]
Dan Filimon commented on MAHOUT-1181: ------------------------------------- Sure! I was going to ask how anyone can review changes with no syntax highlighting or anything. Should the base repository be mahout or mahout-git? > Adding StreamingKMeans MapReduce classes > ---------------------------------------- > > Key: MAHOUT-1181 > URL: https://issues.apache.org/jira/browse/MAHOUT-1181 > Project: Mahout > Issue Type: New Feature > Components: Clustering > Affects Versions: 0.8 > Reporter: Dan Filimon > Attachments: MAHOUT_1181.patch, MAHOUT_1181_props.patch, > MAHOUT_1181_test.patch > > > This patch implements the MapReduce version of StreamingKMeans for > MAHOUT-1154. > It adds 5 new classes: > - CentroidWritable: class representing a centroid that can be written to a > SeqFile > - StreamingKMeansDriver: class implementing AbstractJob that is the entry > point to the mapreduction > - StreamingKMeansMapper: mapper, running StreamingKMeans (see MAHOUT-1162) > clustering the points one by one > - StreamingKMeansReducer: reducer, running BallKMeans (see MAHOUT-1162) a > number of times and picking the clustering with the lowest total clustering > cost. > The cost is determined by randomly splitting the incoming centroids into a > "training" and "test" set, computing the centroids on the training set and > the cost on the test set. The intent is to see whether the centroids actually > describe the distribution of the points or not. > - StreamingKMeansUtilMR: helper class with a method to instantiate a searcher > from a Configuration. > Additionally, there is a test class StreamingKMeansTestMR that tests the > mapper, reducer and mapper and reducer together using MRUnit. > !!! > Since MRUnit is now a dependency, the core pom.xml file adds MRUnit as a > dependency. We depend on snapshot 1.0 which is not yet released (it will be > very soon), hence the updated pom.xml is not provided for now. > !!! -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira