[ https://issues.apache.org/jira/browse/MAHOUT-1154?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Sebastian Schelter resolved MAHOUT-1154. ---------------------------------------- Resolution: Fixed > Implementing Streaming KMeans > ----------------------------- > > Key: MAHOUT-1154 > URL: https://issues.apache.org/jira/browse/MAHOUT-1154 > Project: Mahout > Issue Type: New Feature > Components: Clustering > Affects Versions: 0.8 > Reporter: Dan Filimon > Assignee: Dan Filimon > Fix For: 0.8 > > > An implementation of Streaming KMeans as mentioned in [1] is available here > [2]. > [1]http://mail-archives.apache.org/mod_mbox/mahout-dev/201303.mbox/%3ccaowb3goyf9zufrgxhsucpkjxk6cw0nnr8gwg__jsey+kvab...@mail.gmail.com%3E > [2] https://github.com/dfilimon/mahout > Since there will be more than one patches, there will be specific JIRA issues > that address each one. > The description of the code being added is: > The main classes are in o.a.m.clustering.streaming [1], under the > core/ project. These are subdivided into 2 packages: > - cluster: contains the BallKMeans and StreamingKMeans classes that > can be used standalone. > BallKMeans is exactly what it sounds like (uses k-means++ for the > initialization, then does a normal k-means pass and ignoring > outilers). > StreamingKMeans implements the online clustering that doesn't return > exactly k clusters, (it returns an estimate). This is used to > approximate the data. > - mapreduce: contains the CentroidWritable, StreamingKMeansDriver, > StreamingKMeansMapper and StreamingKMeansReducer classes. > CentroidWritable serializes Centroids (sort of like AbstractCluster). > StreamingKMeansDriver provides the driver for the job. > StreamingKMeansMapper runs StreamingKMeans in the mappers to produce > sketches of the data for the reducer. > StreamingKMeansReducer collects the centroids produced by the > mappers into one set of weighted points and runs BallKMeans on them > producing the final results. > Additionally the searchers are in o.a.m.math.neighborhood > - neighborhood: various searcher classes that implement nearest-neighbor > search using different strategies. > Searcher, UpdatableSearcher: abstract classes that define how to > search through collections of vectors. > BruteSearch: does a brute search (looks at every point...) > ProjectionSearch: uses random projections for searching. > FastProjectionSearch: also uses random projections (but not binary > search trees as in ProjectionSearch). > HashedVector, LocalitySensitiveHashSearch: implement locality > sensitive hash search. > All the tools that I used are in o.a.m.clustering.streaming [2], under > the examples/ project. > There are a bunch of classes here, covering everything from > vectorizing 20 newsgroups data to various IO utils. The more important > ones are: > utils.ExperimentUtils: convenience methods. > tools.ClusterQuality20NewsGroups: actual experiment, with hardcoded paths. > [3] > https://github.com/dfilimon/mahout/tree/skm/core/src/main/java/org/apache/mahout/clustering/streaming > [4] > https://github.com/dfilimon/mahout/tree/skm/examples/src/main/java/org/apache/mahout/clustering/streaming > The relevant issues are: > - MAHOUT-1155 (Centroid, WeightedVector) > - MAHOUT-1156 (searchers) > - MAHOUT-1162 (clustering, non map-reduce) > - MAHOUT-1181 (map-reduce, command-line changes, pom.xml) -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira