I agree that it will be good to see more algorithms added to the MLlib universe, although this does bring to mind a couple of comments:
- MLlib as Mahout.next would be a unfortunate. There are some gems in Mahout, but there are also lots of rocks. Setting a minimal bar of working, correctly implemented, and documented requires a surprising amount of work. - Not getting any signal out of your data with an algorithm like K-means implies one of the following: (1) there is no signal in your data, (2) you should try tuning the algorithm differently, (3) you're using K-means wrong, (4) you should try preparing the data differently, (5) all of the above, or (6) none of the above. My $0.02. -- Paul — p...@mult.ifario.us | Multifarious, Inc. | http://mult.ifario.us/ On Mon, Apr 21, 2014 at 8:58 AM, Sean Owen <so...@cloudera.com> wrote: > Nobody asked me, and this is a comment on a broader question, not this > one, but: > > In light of a number of recent items about adding more algorithms, > I'll say that I personally think an explosion of algorithms should > come after the MLlib "core" is more fully baked. I'm thinking of > finishing out the changes to vectors and matrices, for example. Things > are going to change significantly in the short term as people use the > algorithms and see how well the abstractions do or don't work. I've > seen another similar project suffer mightily from too many algorithms > too early, so maybe I'm just paranoid. > > Anyway, long-term, I think lots of good algorithms is a right and > proper goal for MLlib, myself. Consistent approaches, representations > and APIs will make or break MLlib much more than having or not having > a particular algorithm. With the plumbing in place, writing the algo > is the fun easy part. > -- > Sean Owen | Director, Data Science | London > > > On Mon, Apr 21, 2014 at 4:39 PM, Aliaksei Litouka > <aliaksei.lito...@gmail.com> wrote: > > Hi, Spark developers. > > Are there any plans for implementing new clustering algorithms in MLLib? > As > > far as I understand, current version of Spark ships with only one > > clustering algorithm - K-Means. I want to contribute to Spark and I'm > > thinking of adding more clustering algorithms - maybe > > DBSCAN<http://en.wikipedia.org/wiki/DBSCAN>. > > I can start working on it. Does anyone want to join me? >