That isn't streaming k-means in the Mahout sense. What they have done is implement a very basic sort of exponential smoothing to the normal k-means algorithm so that only recent points contribute significantly to centroid location. This assumes an initial high quality cluster and probably also depends on small changes in the underlying data distribution. It doesn't solve the multi-start problem in high dimensions.
The Mahout algorithm is a bit different. The idea is that you want to do a single pass high quality clustering of a lot of data. This is hard to do with traditional k-means, both because k-means normally requires multiple passes through the data to get good centroids and also because multiple restarts are required to get good results. A streaming solution should also be able to give you an accurate clustering at any point in time with roughly unit-ish cost. All these problems are solved with the Mahout solution. The current problems with the Mahout solution have to do with the fact that the map-reduce solution has poor scaling properties due to the non-trivial size of the cluster sketches. On Thu, Jan 29, 2015 at 7:24 AM, Gianmarco De Francisci Morales < [email protected]> wrote: > Seems they started to play with streaming algorithms also in Spark and > MLlib. > > https://databricks.com/blog/2015/01/28/introducing-streaming-k-means-in-spark-1-2.html > > I wonder how much the mini-batch programming model they have fits > traditional streaming algorithms. > Also, I guess the concept of state across the stream does not fit very well > the abstraction of RDDs. > > Interesting to read nevertheless. > > Cheers, > -- > Gianmarco >
