Dmitriy is correct in that the Streaming KMeans in MlLib is a wrong name for something that was meant to convey "Spark Streaming + KMeans".
The Mahout Streaming KMeans is an implementation of the Meyerson paper that's been referred to in Dmitriy's email. I have had folks wrongly misconstrue Streaming KMeans as being Spark Streaming + KMeans, thanks to the bad naming on part of the MlLib folks. I had spoken to Jeremy Freeman, the MLLib Streaming KMEans contributor about this and he agrees that the intent was to convery "Spark Streaming + KMeans", he definitely wasn't aware of the Streaming KMeans algorithm that existed much before Spark Streaming. On Tue, Jun 16, 2015 at 5:34 PM, Dmitriy Lyubimov <dlie...@gmail.com> wrote: > "streaming k-means" is something else afaik. Streaming k-means is reserved > for a particular k-means method (in Mahout, at least, [1]). > > Whereas as far as i understand what mllib calls "streaming k-means" is name > given by mllib contributor which really means "online k-means", i.e. radar > tracking of centroids over time over stream of (x_i, t_i) pairs and that > uses Spark streaming, but has nothing to do with Shindler et. al. method. > > At least that was our understanding last time we looked at the issue of the > names here. > > So this issue has come several times already when people come and say, > "what? streaming k-means? mllib has streaming k-means" whereas everybody > else is talking about something else. > > [1] > > http://papers.nips.cc/paper/4362-fast-and-accurate-k-means-for-large-datasets.pdf > > On Tue, Jun 16, 2015 at 2:04 PM, RJ Nowling <rnowl...@gmail.com> wrote: > > > There is a streaming k-means implementation in MLlib that uses reservoir > > sampling. > > > > On Tue, Jun 2, 2015 at 2:24 AM, Marko Dinic <marko.di...@nissatech.com> > > wrote: > > > > > Ted, > > > > > > Thank you for your answer. > > > > > > What would you then recommend me to do? My idea is to implement it to > > > enable clustering of time series using DTW (Dynamic Time Warping) as > > > distance measure. As you know, the main problem is that K-medoids is > not > > > scalable, so that's standing on my way. Of course, it could be used > with > > > other distances as well. > > > > > > I have already implemented something that I consider a scalable > > K-medoids, > > > based on using pivots to speed up medoid selection ( > > > https://seer.lcc.ufmg.br/index.php/jidm/article/viewFile/99/82). This > > > works for distance measures such as Euclidean, has some limitations > (best > > > results are in case of normal distribution, outliers could be a > problem), > > > but it works pretty good (considering the computations). The thing is, > it > > > can't be used with DTW, since it relies on projections, while triangle > > > inequality for DTW does not hold. That is why I'm considering this > > > Streaming approach now. > > > > > > Would you think that it is worthy of giving a shot? I'm really > stretching > > > for a scalable solution. > > > > > > Best regards, > > > Marko > > > > > > > > > On Tue 02 Jun 2015 12:03:40 AM CEST, Ted Dunning wrote: > > > > > >> The streaming k-means works by building a sketch of the data which is > > then > > >> used to do real clustering. > > >> > > >> It might be that this sketch would be acceptable to do k-medoids, but > > that > > >> is definitely not guaranteed. > > >> > > >> Similarly, it might be possible to build a medoid sketch instead of a > > mean > > >> based sketch, but this is also unexplored ground. > > >> > > >> The virtue of the first approach (using a m-means sketch as input to > > >> k-medoids) would be that it would make the k-medoids scalable. > > >> > > >> > > >> > > >> On Mon, Jun 1, 2015 at 1:04 PM, Marko Dinic < > marko.di...@nissatech.com> > > >> wrote: > > >> > > >> Hello everyone, > > >>> > > >>> I have an idea and I would like to get a validation from community > > about > > >>> it. > > >>> > > >>> In Mahout there is an implementation of Streaming K-means. I'm > > interested > > >>> in your opinion would it make sense to make a similar implementation > of > > >>> Streaming K-medoids? > > >>> > > >>> K-medoids has even bigger problems than K-means because it's not > > >>> scalable, > > >>> but can be useful in some cases (e.g. It allows more sophisticated > > >>> distance > > >>> measures). > > >>> > > >>> What is your opinion about implementation of this? > > >>> > > >>> Best regards, > > >>> Marko > > >>> > > >>> > > >> > > >