Re: Any plans for new clustering algorithms?

Aliaksei Litouka Mon, 21 Apr 2014 09:57:15 -0700

Thank you very much for detailed answers.
I can't but agree that a good MLLib core is a higher priority than
algorithms built on top of it. I'll check if I can contribute anything to
the core. I will also follow Nick Pentreath's recommendation to start a new
GitHub project. Actually, here is a link to repository:
https://github.com/alitouka/spark_dbscan . Currently it is empty - I've
just created it :)



2014-04-21 11:40 GMT-05:00 Nick Pentreath <nick.pentre...@gmail.com>:

> I am very much +1 on Sean's comment.
>
> I think the correct abstractions and API for Vectors, Matrices and
> distributed matrices (distributed row matrix etc) will, once bedded down
> and battle tested in the wild, allow a whole lot of flexibility for
> developers of algorithms on top of MLlib core.
>
> This is true whether the algorithm finds itself in MLlib, MLBase, or
> resides in a separate contrib project. Just like Spark core sometimes risks
> becoming "trying to please everybody" by having the kitchen sink in terms
> of Hadoop integration aspects or RDD operations, and thus a spark-contrib
> project may make a lot of sense. So too could ml-contrib hold a lot of
> algorithms that are not core but still of wide interest. This can include,
> for example, models that are still cutting edge and perhaps not as widely
> used in production yet, or specialist models that are of interest to a more
> niche group.
>
> scikit-learn is very tough about this, requiring a very high bar for
> including a new algorithm (many citations, dev support, proof of strong
> performance and wide demand). And this leads to a very high quality code
> base in general.
>
> I'd say we should (if it hasn't been done already, I may have missed such a
> discussion), decide precisely what does constitute MLlib's "1.0.0" goals
> for algorithms. I'd say what we have in terms of clustering (K-Means||),
> linear models, decision trees and collaborative filtering is pretty much a
> good goal. Potentially the Random Forest implementation on top of the DT,
> and perhaps another form of recommendation model (such as the co-occurrence
> models cf. Mahout's) could be potential candidates for inclusion. I'd also
> say any other optimization methods/procedures in addition to SGD and LBFGS
> that are very strong and widely used for a variety of (distributed) ML
> problems, could be candidates. And finally things like useful utils,
> cross-validation and evaluation methods, etc.
>
> So I'd say by all means, please work on a new model such as DBSCAN. Put it
> in a new GitHub project, post some detailed performance comparisons vs
> MLlib K-Means, and then in future if it gets included in MLlib core it's a
> pretty easy to do.
>
>
> On Mon, Apr 21, 2014 at 6:07 PM, Evan R. Sparks <evan.spa...@gmail.com
> >wrote:
>
> > While DBSCAN and others would be welcome contributions, I couldn't agree
> > more with Sean.
> >
> >
> >
> >
> > On Mon, Apr 21, 2014 at 8:58 AM, Sean Owen <so...@cloudera.com> wrote:
> >
> > > Nobody asked me, and this is a comment on a broader question, not this
> > > one, but:
> > >
> > > In light of a number of recent items about adding more algorithms,
> > > I'll say that I personally think an explosion of algorithms should
> > > come after the MLlib "core" is more fully baked. I'm thinking of
> > > finishing out the changes to vectors and matrices, for example. Things
> > > are going to change significantly in the short term as people use the
> > > algorithms and see how well the abstractions do or don't work. I've
> > > seen another similar project suffer mightily from too many algorithms
> > > too early, so maybe I'm just paranoid.
> > >
> > > Anyway, long-term, I think lots of good algorithms is a right and
> > > proper goal for MLlib, myself. Consistent approaches, representations
> > > and APIs will make or break MLlib much more than having or not having
> > > a particular algorithm. With the plumbing in place, writing the algo
> > > is the fun easy part.
> > > --
> > > Sean Owen | Director, Data Science | London
> > >
> > >
> > > On Mon, Apr 21, 2014 at 4:39 PM, Aliaksei Litouka
> > > <aliaksei.lito...@gmail.com> wrote:
> > > > Hi, Spark developers.
> > > > Are there any plans for implementing new clustering algorithms in
> > MLLib?
> > > As
> > > > far as I understand, current version of Spark ships with only one
> > > > clustering algorithm - K-Means. I want to contribute to Spark and I'm
> > > > thinking of adding more clustering algorithms - maybe
> > > > DBSCAN<http://en.wikipedia.org/wiki/DBSCAN>.
> > > > I can start working on it. Does anyone want to join me?
> > >
> >
>

Re: Any plans for new clustering algorithms?

Reply via email to