Re: What are the best settings for my clustering task

2013-10-02 Thread Jens Bonerz
Isn't the streaming k-means just a different approach to crunch through the data? In other words, the result of streaming k-means should be comparable to using k-means in multiple chained map reduce cycles? I just read a paper about the k-means clustering and its underlying algorithm. According

Re: Solr-recommender

2013-10-02 Thread Pat Ferrel
Excellent. From Ellen's description the first Music use may be an implicit preference based recommender using synthetic data? I'm quickly discovering how flexible Solr use is in many of these cases. Here's another use you may have thought of: Shopping cart recommenders, as goes the intuition,

Re: What are the best settings for my clustering task

2013-10-02 Thread Ted Dunning
The way that the new streaming k-means works is that there is a first sketch pass which only requires an upper bound on the final number of clusters you will want. It adaptively creates more or less clusters depending on the data and your bound. This sketch is guaranteed to be computed within at

Re: What are the best settings for my clustering task

2013-10-02 Thread Jens Bonerz
thx for your elaborate answer. so if the upper bound on the final number of clusters is unknown in the beginning, what would happen, if I define a very high number that is guaranteed to be the estimated number of clusters. for example if I set it to 10.000 clusters if an estimate of 5.000 is

Re: What are the best settings for my clustering task

2013-10-02 Thread Ted Dunning
Yes. That will work. The sketch will then contain 10,000 x log N centroids. If N = 10^9, log N \approx 30 so the sketch will have at about 300,000 weighted centroids in it. The final clustering will have to process these centroids to produce the desired 5,000 clusters. Since 300,000 is a