Isn't the streaming k-means just a different approach to crunch through the
data? In other words, the result of streaming k-means should be comparable
to using k-means in multiple chained map reduce cycles?
I just read a paper about the k-means clustering and its underlying
algorithm.
According
Excellent. From Ellen's description the first Music use may be an implicit
preference based recommender using synthetic data? I'm quickly discovering how
flexible Solr use is in many of these cases.
Here's another use you may have thought of:
Shopping cart recommenders, as goes the intuition,
The way that the new streaming k-means works is that there is a first
sketch pass which only requires an upper bound on the final number of
clusters you will want. It adaptively creates more or less clusters
depending on the data and your bound. This sketch is guaranteed to be
computed within at
thx for your elaborate answer.
so if the upper bound on the final number of clusters is unknown in the
beginning, what would happen, if I define a very high number that is
guaranteed to be the estimated number of clusters.
for example if I set it to 10.000 clusters if an estimate of 5.000 is
Yes. That will work.
The sketch will then contain 10,000 x log N centroids. If N = 10^9, log N
\approx 30 so the sketch will have at about 300,000 weighted centroids in
it. The final clustering will have to process these centroids to produce
the desired 5,000 clusters. Since 300,000 is a