On Mar 27, 2013, at 12:47 PM, Ted Dunning wrote:
> And, of course, due credit should be given here. The advanced clustering
> algorithms in Crunch were lifted from the new stuff in Mahout pretty much
> step for step.
>
> The Mahout group would have loved to have contributions from the Cloude
Spark would be an excellent choice for the iterative sort of k-means.
It could be good for sketch-based algorithms as well, but the difference
would be much less pronounced.
On Wed, Mar 27, 2013 at 3:39 PM, Charles Earl wrote:
> I would think also that starting with centers in some in-memory
And, of course, due credit should be given here. The advanced clustering
algorithms in Crunch were lifted from the new stuff in Mahout pretty much
step for step.
The Mahout group would have loved to have contributions from the Cloudera
guys instead of re-implementation, but you can't legislate ta
I would think also that starting with centers in some in-memory Hadoop platform
like spark would also be a valid approach.
I think the spark demo assumes that the data set is cached vs just centers.
C
On Mar 27, 2013, at 9:24 AM, Bertrand Dechoux wrote:
> And there is also Cascading ;) : http:
And there is also Cascading ;) : http://www.cascading.org/
But like Crunch, this is Hadoop. Both are 'only' higher APIs for MapReduce.
As for the number of reducers, you will have to do the math yourself but
I highly doubt that more than one reducer is needed (imho). But you can
indeed distribute
Thanks!
*Bertrand*: I don't like the idea of using a single reducer. A better way
for me is to write all the output of all the reducers to the same
directory, and then distribute all the files.
I know about Mahout of course, but I want to implement it myself. I will
look at the documentation though
If you're also a fan of doing things the better way, you can also
checkout some Apache Crunch (http://crunch.apache.org) ways of doing
this via https://github.com/cloudera/ml (blog post:
http://blog.cloudera.com/blog/2013/03/cloudera_ml_data_science_tools/).
On Wed, Mar 27, 2013 at 3:29 PM, Yaron
Of course, you should check out Mahout, at least the documentation, even if
you really want to implement it by yourself.
https://cwiki.apache.org/MAHOUT/k-means-clustering.html
Regards
Bertrand
On Wed, Mar 27, 2013 at 1:34 PM, Bertrand Dechoux wrote:
> Actually for the first step, the client co
Actually for the first step, the client could create a file with the
centers and then put it on hdfs and use it with distributed cache.
A single reducer might be enough and that case, its only responsibility is
to create the file with the updated centers.
You can then use this new file again in the
Hi,
I'd like to implement k-means by myself, in the following naive way:
Given a large set of vectors:
1. Generate k random centers from set.
2. Mapper reads all center and a split of the vectors set and emits for
each vector the closest center as a key.
3. Reducer calculated new cente
10 matches
Mail list logo