There is a lot of work in stream clustering from two communities, the streaming algorithmic community, and the machine learning/data stream mining community.
There is a nice survey of these new methods in: https://www.researchgate.net/publication/257132178_Data_Stream_Clustering_A_Survey http://dl.acm.org/citation.cfm?id=2522981 In Apache SAMOA, CluStream is implemented as one of the first state-of-the-art methods: http://www.vldb.org/conf/2003/papers/S04P02.pdf and it could be nice to have more methods implemented, as for example the ones implemented in MOA. On Mon, Feb 2, 2015 at 7:07 AM, Ted Dunning <[email protected]> wrote: > That isn't streaming k-means in the Mahout sense. What they have done is > implement a very basic sort of exponential smoothing to the normal k-means > algorithm so that only recent points contribute significantly to centroid > location. This assumes an initial high quality cluster and probably also > depends on small changes in the underlying data distribution. It doesn't > solve the multi-start problem in high dimensions. > > The Mahout algorithm is a bit different. The idea is that you want to do a > single pass high quality clustering of a lot of data. This is hard to do > with traditional k-means, both because k-means normally requires multiple > passes through the data to get good centroids and also because multiple > restarts are required to get good results. A streaming solution should > also be able to give you an accurate clustering at any point in time with > roughly unit-ish cost. All these problems are solved with the Mahout > solution. The current problems with the Mahout solution have to do with > the fact that the map-reduce solution has poor scaling properties due to > the non-trivial size of the cluster sketches. > > > > > > On Thu, Jan 29, 2015 at 7:24 AM, Gianmarco De Francisci Morales < > [email protected]> wrote: > > > Seems they started to play with streaming algorithms also in Spark and > > MLlib. > > > > > https://databricks.com/blog/2015/01/28/introducing-streaming-k-means-in-spark-1-2.html > > > > I wonder how much the mini-batch programming model they have fits > > traditional streaming algorithms. > > Also, I guess the concept of state across the stream does not fit very > well > > the abstraction of RDDs. > > > > Interesting to read nevertheless. > > > > Cheers, > > -- > > Gianmarco > > >
