I am very much +1 on Sean's comment.

I think the correct abstractions and API for Vectors, Matrices and
distributed matrices (distributed row matrix etc) will, once bedded down
and battle tested in the wild, allow a whole lot of flexibility for
developers of algorithms on top of MLlib core.

This is true whether the algorithm finds itself in MLlib, MLBase, or
resides in a separate contrib project. Just like Spark core sometimes risks
becoming "trying to please everybody" by having the kitchen sink in terms
of Hadoop integration aspects or RDD operations, and thus a spark-contrib
project may make a lot of sense. So too could ml-contrib hold a lot of
algorithms that are not core but still of wide interest. This can include,
for example, models that are still cutting edge and perhaps not as widely
used in production yet, or specialist models that are of interest to a more
niche group.

scikit-learn is very tough about this, requiring a very high bar for
including a new algorithm (many citations, dev support, proof of strong
performance and wide demand). And this leads to a very high quality code
base in general.

I'd say we should (if it hasn't been done already, I may have missed such a
discussion), decide precisely what does constitute MLlib's "1.0.0" goals
for algorithms. I'd say what we have in terms of clustering (K-Means||),
linear models, decision trees and collaborative filtering is pretty much a
good goal. Potentially the Random Forest implementation on top of the DT,
and perhaps another form of recommendation model (such as the co-occurrence
models cf. Mahout's) could be potential candidates for inclusion. I'd also
say any other optimization methods/procedures in addition to SGD and LBFGS
that are very strong and widely used for a variety of (distributed) ML
problems, could be candidates. And finally things like useful utils,
cross-validation and evaluation methods, etc.

So I'd say by all means, please work on a new model such as DBSCAN. Put it
in a new GitHub project, post some detailed performance comparisons vs
MLlib K-Means, and then in future if it gets included in MLlib core it's a
pretty easy to do.


On Mon, Apr 21, 2014 at 6:07 PM, Evan R. Sparks <evan.spa...@gmail.com>wrote:

> While DBSCAN and others would be welcome contributions, I couldn't agree
> more with Sean.
>
>
>
>
> On Mon, Apr 21, 2014 at 8:58 AM, Sean Owen <so...@cloudera.com> wrote:
>
> > Nobody asked me, and this is a comment on a broader question, not this
> > one, but:
> >
> > In light of a number of recent items about adding more algorithms,
> > I'll say that I personally think an explosion of algorithms should
> > come after the MLlib "core" is more fully baked. I'm thinking of
> > finishing out the changes to vectors and matrices, for example. Things
> > are going to change significantly in the short term as people use the
> > algorithms and see how well the abstractions do or don't work. I've
> > seen another similar project suffer mightily from too many algorithms
> > too early, so maybe I'm just paranoid.
> >
> > Anyway, long-term, I think lots of good algorithms is a right and
> > proper goal for MLlib, myself. Consistent approaches, representations
> > and APIs will make or break MLlib much more than having or not having
> > a particular algorithm. With the plumbing in place, writing the algo
> > is the fun easy part.
> > --
> > Sean Owen | Director, Data Science | London
> >
> >
> > On Mon, Apr 21, 2014 at 4:39 PM, Aliaksei Litouka
> > <aliaksei.lito...@gmail.com> wrote:
> > > Hi, Spark developers.
> > > Are there any plans for implementing new clustering algorithms in
> MLLib?
> > As
> > > far as I understand, current version of Spark ships with only one
> > > clustering algorithm - K-Means. I want to contribute to Spark and I'm
> > > thinking of adding more clustering algorithms - maybe
> > > DBSCAN<http://en.wikipedia.org/wiki/DBSCAN>.
> > > I can start working on it. Does anyone want to join me?
> >
>

Reply via email to