Yes, I looked at this one. Picture shows the general idea, but this is a very high level. I'm rather interested in implementation. Initial clusters are modified in ClusterIterator.iterateMR CIMapper setup - It loads regular hdfs file or is it a cache, why? It looks as if it's regular file. Isn't it a huge overhead to load this file at setup? What's the reasonable size of this file? Why everything is written during cleanup call? Is it used instead of combiner? General idea - During each iteration the output of the previous iteration (reduce) is loadead at setup, then the model is updated and propagated (cleanup) to reducers. What happens if one cluster is very large ~70% of all elements in the data set? Cleanup helps with that? Stop condition - isConverged - Does it compare outputs (2 files) from last two iterations or is it encapsulated in Cluster class?
On Sun, Apr 13, 2014 at 4:32 PM, Sebastian Schelter <s...@apache.org> wrote: > Did you check the website at https://mahout.apache.org/ > users/clustering/k-means-clustering.html ? > > > On 04/13/2014 02:53 PM, Maciej Mazur wrote: > >> Recently I've been looking into K-means implementation. >> I want to understand how it works, and why it was designed this way. >> Could you give me some overview? >> I see that during the setup clusters are read from the file. Is it a >> distributed cache? What's the maxmial size of this file, what's the >> maximum value of k? >> There is nothing outputed during the call of map function, everything is >> saved at cleanup. Why? >> Are there any docs concerning implementation? >> >> Thanks, >> Maciej >> >> >> On Wed, Apr 9, 2014 at 7:23 AM, Ted Dunning <ted.dunn...@gmail.com> >> wrote: >> >> >>> Well, you could view this as a performance bug in the implementation of >>> the linear algebra. >>> >>> It certainly is, however, an odd interpretation of transpose. I have >>> used >>> a similar trick in r to use sparse matrices as a counter but it always >>> worried me a bit. >>> >>> Sent from my iPhone >>> >>> On Apr 8, 2014, at 17:49, Dmitriy Lyubimov <dlie...@gmail.com> wrote: >>>> >>>> Problem is, I want to use linear algebra to handle that, not >>>> combine(). >>>> >>> >>> >> >