Guys, please can anyone have a look at this patch? I'd really like to
merge. :)


On Sat, May 11, 2013 at 10:03 AM, Dan Filimon
<[email protected]>wrote:

>    This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/10193/
>   Review request for mahout, Ted Dunning, Jake Mannix, Sebastian
> Schelter, Suneel Marthi, and Robin Anil.
> By Dan Filimon.
>
> *Updated May 11, 2013, 7:03 a.m.*
> Changes
>
> Ping, please have a look at the map/reduce classes.
>
>   Description
>
> This depends (loosely) on https://reviews.apache.org/r/10194/
>
> This patch implements the MapReduce version of StreamingKMeans for 
> MAHOUT-1154.
>
> It adds 5 new classes:
> - CentroidWritable: class representing a centroid that can be written to a 
> SeqFile
> - StreamingKMeansDriver: class implementing AbstractJob that is the entry 
> point to the mapreduction
> - StreamingKMeansMapper: mapper, running StreamingKMeans (see MAHOUT-1162) 
> clustering the points one by one
> - StreamingKMeansReducer: reducer, running BallKMeans (see MAHOUT-1162) a 
> number of times and picking the clustering with the lowest total clustering 
> cost.
> The cost is determined by randomly splitting the incoming centroids into a 
> "training" and "test" set, computing the centroids on the training set and 
> the cost on the test set. The intent is to see whether the centroids actually 
> describe the distribution of the points or not.
> - StreamingKMeansUtilMR: helper class with a method to instantiate a searcher 
> from a Configuration.
>
> Additionally, there is a test class StreamingKMeansTestMR that tests the 
> mapper, reducer and mapper and reducer together using MRUnit.
>
> !!!
> Since MRUnit is now a dependency, the core pom.xml file adds MRUnit as a 
> dependency. We depend on snapshot 1.0 which is not yet released (it will be 
> very soon), hence the updated pom.xml is not provided for now.
> !!!
>
>   Testing
>
> See StreamingKMeansTestMR for the tests. These are all performed on data 
> sample from a "hypercube" distribution (there are multinormal distributions 
> in each vertex of the cube).
> Additionally there are ongoing tests on the 20 newsgroups data set (and some 
> more are on the way).
>
>   Diffs
>
>    - core/src/main/java/org/apache/mahout/clustering/ClusteringUtils.java
>    (PRE-CREATION)
>    - 
> core/src/main/java/org/apache/mahout/clustering/streaming/cluster/BallKMeans.java
>    (PRE-CREATION)
>    - 
> core/src/main/java/org/apache/mahout/clustering/streaming/mapreduce/CentroidWritable.java
>    (PRE-CREATION)
>    - 
> core/src/main/java/org/apache/mahout/clustering/streaming/mapreduce/StreamingKMeansDriver.java
>    (PRE-CREATION)
>    - 
> core/src/main/java/org/apache/mahout/clustering/streaming/mapreduce/StreamingKMeansMapper.java
>    (PRE-CREATION)
>    - 
> core/src/main/java/org/apache/mahout/clustering/streaming/mapreduce/StreamingKMeansReducer.java
>    (PRE-CREATION)
>    - 
> core/src/main/java/org/apache/mahout/clustering/streaming/mapreduce/StreamingKMeansThread.java
>    (PRE-CREATION)
>    - 
> core/src/main/java/org/apache/mahout/clustering/streaming/mapreduce/StreamingKMeansUtilsMR.java
>    (PRE-CREATION)
>    - 
> core/src/test/java/org/apache/mahout/clustering/streaming/mapreduce/StreamingKMeansTestMR.java
>    (PRE-CREATION)
>    - src/conf/driver.classes.default.props (ac45eef)
>
> View Diff <https://reviews.apache.org/r/10193/diff/>
>

Reply via email to