Guys, please can anyone have a look at this patch? I'd really like to merge. :)
On Sat, May 11, 2013 at 10:03 AM, Dan Filimon <[email protected]>wrote: > This is an automatically generated e-mail. To reply, visit: > https://reviews.apache.org/r/10193/ > Review request for mahout, Ted Dunning, Jake Mannix, Sebastian > Schelter, Suneel Marthi, and Robin Anil. > By Dan Filimon. > > *Updated May 11, 2013, 7:03 a.m.* > Changes > > Ping, please have a look at the map/reduce classes. > > Description > > This depends (loosely) on https://reviews.apache.org/r/10194/ > > This patch implements the MapReduce version of StreamingKMeans for > MAHOUT-1154. > > It adds 5 new classes: > - CentroidWritable: class representing a centroid that can be written to a > SeqFile > - StreamingKMeansDriver: class implementing AbstractJob that is the entry > point to the mapreduction > - StreamingKMeansMapper: mapper, running StreamingKMeans (see MAHOUT-1162) > clustering the points one by one > - StreamingKMeansReducer: reducer, running BallKMeans (see MAHOUT-1162) a > number of times and picking the clustering with the lowest total clustering > cost. > The cost is determined by randomly splitting the incoming centroids into a > "training" and "test" set, computing the centroids on the training set and > the cost on the test set. The intent is to see whether the centroids actually > describe the distribution of the points or not. > - StreamingKMeansUtilMR: helper class with a method to instantiate a searcher > from a Configuration. > > Additionally, there is a test class StreamingKMeansTestMR that tests the > mapper, reducer and mapper and reducer together using MRUnit. > > !!! > Since MRUnit is now a dependency, the core pom.xml file adds MRUnit as a > dependency. We depend on snapshot 1.0 which is not yet released (it will be > very soon), hence the updated pom.xml is not provided for now. > !!! > > Testing > > See StreamingKMeansTestMR for the tests. These are all performed on data > sample from a "hypercube" distribution (there are multinormal distributions > in each vertex of the cube). > Additionally there are ongoing tests on the 20 newsgroups data set (and some > more are on the way). > > Diffs > > - core/src/main/java/org/apache/mahout/clustering/ClusteringUtils.java > (PRE-CREATION) > - > core/src/main/java/org/apache/mahout/clustering/streaming/cluster/BallKMeans.java > (PRE-CREATION) > - > core/src/main/java/org/apache/mahout/clustering/streaming/mapreduce/CentroidWritable.java > (PRE-CREATION) > - > core/src/main/java/org/apache/mahout/clustering/streaming/mapreduce/StreamingKMeansDriver.java > (PRE-CREATION) > - > core/src/main/java/org/apache/mahout/clustering/streaming/mapreduce/StreamingKMeansMapper.java > (PRE-CREATION) > - > core/src/main/java/org/apache/mahout/clustering/streaming/mapreduce/StreamingKMeansReducer.java > (PRE-CREATION) > - > core/src/main/java/org/apache/mahout/clustering/streaming/mapreduce/StreamingKMeansThread.java > (PRE-CREATION) > - > core/src/main/java/org/apache/mahout/clustering/streaming/mapreduce/StreamingKMeansUtilsMR.java > (PRE-CREATION) > - > core/src/test/java/org/apache/mahout/clustering/streaming/mapreduce/StreamingKMeansTestMR.java > (PRE-CREATION) > - src/conf/driver.classes.default.props (ac45eef) > > View Diff <https://reviews.apache.org/r/10193/diff/> >
