Re: Naïve k-means using hadoop

2013-03-27 Thread Mark Miller
On Mar 27, 2013, at 12:47 PM, Ted Dunning wrote: > And, of course, due credit should be given here. The advanced clustering > algorithms in Crunch were lifted from the new stuff in Mahout pretty much > step for step. > > The Mahout group would have loved to have contributions from the Cloude

Re: Naïve k-means using hadoop

2013-03-27 Thread Ted Dunning
Spark would be an excellent choice for the iterative sort of k-means. It could be good for sketch-based algorithms as well, but the difference would be much less pronounced. On Wed, Mar 27, 2013 at 3:39 PM, Charles Earl wrote: > I would think also that starting with centers in some in-memory

Re: Naïve k-means using hadoop

2013-03-27 Thread Ted Dunning
And, of course, due credit should be given here. The advanced clustering algorithms in Crunch were lifted from the new stuff in Mahout pretty much step for step. The Mahout group would have loved to have contributions from the Cloudera guys instead of re-implementation, but you can't legislate ta

Re: Naïve k-means using hadoop

2013-03-27 Thread Charles Earl
I would think also that starting with centers in some in-memory Hadoop platform like spark would also be a valid approach. I think the spark demo assumes that the data set is cached vs just centers. C On Mar 27, 2013, at 9:24 AM, Bertrand Dechoux wrote: > And there is also Cascading ;) : http:

Re: Naïve k-means using hadoop

2013-03-27 Thread Bertrand Dechoux
And there is also Cascading ;) : http://www.cascading.org/ But like Crunch, this is Hadoop. Both are 'only' higher APIs for MapReduce. As for the number of reducers, you will have to do the math yourself but I highly doubt that more than one reducer is needed (imho). But you can indeed distribute

Re: Naïve k-means using hadoop

2013-03-27 Thread Yaron Gonen
Thanks! *Bertrand*: I don't like the idea of using a single reducer. A better way for me is to write all the output of all the reducers to the same directory, and then distribute all the files. I know about Mahout of course, but I want to implement it myself. I will look at the documentation though

Re: Naïve k-means using hadoop

2013-03-27 Thread Harsh J
If you're also a fan of doing things the better way, you can also checkout some Apache Crunch (http://crunch.apache.org) ways of doing this via https://github.com/cloudera/ml (blog post: http://blog.cloudera.com/blog/2013/03/cloudera_ml_data_science_tools/). On Wed, Mar 27, 2013 at 3:29 PM, Yaron

Re: Naïve k-means using hadoop

2013-03-27 Thread Bertrand Dechoux
Of course, you should check out Mahout, at least the documentation, even if you really want to implement it by yourself. https://cwiki.apache.org/MAHOUT/k-means-clustering.html Regards Bertrand On Wed, Mar 27, 2013 at 1:34 PM, Bertrand Dechoux wrote: > Actually for the first step, the client co

Re: Naïve k-means using hadoop

2013-03-27 Thread Bertrand Dechoux
Actually for the first step, the client could create a file with the centers and then put it on hdfs and use it with distributed cache. A single reducer might be enough and that case, its only responsibility is to create the file with the updated centers. You can then use this new file again in the

Naïve k-means using hadoop

2013-03-27 Thread Yaron Gonen
Hi, I'd like to implement k-means by myself, in the following naive way: Given a large set of vectors: 1. Generate k random centers from set. 2. Mapper reads all center and a split of the vectors set and emits for each vector the closest center as a key. 3. Reducer calculated new cente