Thank you very much for detailed answers. I can't but agree that a good MLLib core is a higher priority than algorithms built on top of it. I'll check if I can contribute anything to the core. I will also follow Nick Pentreath's recommendation to start a new GitHub project. Actually, here is a link to repository: https://github.com/alitouka/spark_dbscan . Currently it is empty - I've just created it :)
2014-04-21 11:40 GMT-05:00 Nick Pentreath <nick.pentre...@gmail.com>: > I am very much +1 on Sean's comment. > > I think the correct abstractions and API for Vectors, Matrices and > distributed matrices (distributed row matrix etc) will, once bedded down > and battle tested in the wild, allow a whole lot of flexibility for > developers of algorithms on top of MLlib core. > > This is true whether the algorithm finds itself in MLlib, MLBase, or > resides in a separate contrib project. Just like Spark core sometimes risks > becoming "trying to please everybody" by having the kitchen sink in terms > of Hadoop integration aspects or RDD operations, and thus a spark-contrib > project may make a lot of sense. So too could ml-contrib hold a lot of > algorithms that are not core but still of wide interest. This can include, > for example, models that are still cutting edge and perhaps not as widely > used in production yet, or specialist models that are of interest to a more > niche group. > > scikit-learn is very tough about this, requiring a very high bar for > including a new algorithm (many citations, dev support, proof of strong > performance and wide demand). And this leads to a very high quality code > base in general. > > I'd say we should (if it hasn't been done already, I may have missed such a > discussion), decide precisely what does constitute MLlib's "1.0.0" goals > for algorithms. I'd say what we have in terms of clustering (K-Means||), > linear models, decision trees and collaborative filtering is pretty much a > good goal. Potentially the Random Forest implementation on top of the DT, > and perhaps another form of recommendation model (such as the co-occurrence > models cf. Mahout's) could be potential candidates for inclusion. I'd also > say any other optimization methods/procedures in addition to SGD and LBFGS > that are very strong and widely used for a variety of (distributed) ML > problems, could be candidates. And finally things like useful utils, > cross-validation and evaluation methods, etc. > > So I'd say by all means, please work on a new model such as DBSCAN. Put it > in a new GitHub project, post some detailed performance comparisons vs > MLlib K-Means, and then in future if it gets included in MLlib core it's a > pretty easy to do. > > > On Mon, Apr 21, 2014 at 6:07 PM, Evan R. Sparks <evan.spa...@gmail.com > >wrote: > > > While DBSCAN and others would be welcome contributions, I couldn't agree > > more with Sean. > > > > > > > > > > On Mon, Apr 21, 2014 at 8:58 AM, Sean Owen <so...@cloudera.com> wrote: > > > > > Nobody asked me, and this is a comment on a broader question, not this > > > one, but: > > > > > > In light of a number of recent items about adding more algorithms, > > > I'll say that I personally think an explosion of algorithms should > > > come after the MLlib "core" is more fully baked. I'm thinking of > > > finishing out the changes to vectors and matrices, for example. Things > > > are going to change significantly in the short term as people use the > > > algorithms and see how well the abstractions do or don't work. I've > > > seen another similar project suffer mightily from too many algorithms > > > too early, so maybe I'm just paranoid. > > > > > > Anyway, long-term, I think lots of good algorithms is a right and > > > proper goal for MLlib, myself. Consistent approaches, representations > > > and APIs will make or break MLlib much more than having or not having > > > a particular algorithm. With the plumbing in place, writing the algo > > > is the fun easy part. > > > -- > > > Sean Owen | Director, Data Science | London > > > > > > > > > On Mon, Apr 21, 2014 at 4:39 PM, Aliaksei Litouka > > > <aliaksei.lito...@gmail.com> wrote: > > > > Hi, Spark developers. > > > > Are there any plans for implementing new clustering algorithms in > > MLLib? > > > As > > > > far as I understand, current version of Spark ships with only one > > > > clustering algorithm - K-Means. I want to contribute to Spark and I'm > > > > thinking of adding more clustering algorithms - maybe > > > > DBSCAN<http://en.wikipedia.org/wiki/DBSCAN>. > > > > I can start working on it. Does anyone want to join me? > > > > > >