This thread is split off from the "Feedback on MLlib roadmap process
proposal" thread for discussing the high-level mission and goals for
MLlib.  I hope this thread will collect feedback and ideas, not necessarily
lead to huge decisions.

Copying from the previous thread:

*Seth:*
"""
I would love to hear some discussion on the higher level goal of Spark
MLlib (if this derails the original discussion, please let me know and we
can discuss in another thread). The roadmap does contain specific items
that help to convey some of this (ML parity with MLlib, model persistence,
etc...), but I'm interested in what the "mission" of Spark MLlib is. We
often see PRs for brand new algorithms which are sometimes rejected and
sometimes not. Do we aim to keep implementing more and more algorithms? Or
is our focus really, now that we have a reasonable library of algorithms,
to simply make the existing ones faster/better/more robust? Should we aim
to make interfaces that are easily extended for developers to easily
implement their own custom code (e.g. custom optimization libraries), or do
we want to restrict things to out-of-the box algorithms? Should we focus on
more flexible, general abstractions like distributed linear algebra?

I was not involved in the project in the early days of MLlib when this
discussion may have happened, but I think it would be useful to either
revisit it or restate it here for some of the newer developers.
"""

*Mingjie:*
"""
+1 general abstractions like distributed linear algebra.
"""


I'll add my thoughts, starting with our past *t**rajectory*:
* Initially, MLlib was mainly trying to build a set of core algorithms.
* Two years ago, the big effort was adding Pipelines.
* In the last year, big efforts have been around completing Pipelines and
making the library more robust.

I agree with Seth that a few *immediate goals* are very clear:
* feature parity for DataFrame-based API
* completing and improving testing for model persistence
* Python, R parity

*In the future*, it's harder to say, but if I had to pick my top 2 items,
I'd list:

*(1) Making MLlib more extensible*
It will not be feasible to support a huge number of algorithms, so allowing
users to customize their ML on Spark workflows will be critical.  This is
IMO the most important thing we could do for MLlib.
Part of this could be building a healthy community of Spark Packages, and
we will need to make it easier for users to write their own algorithms and
packages to facilitate this.  Part of this could be allowing users to
customize existing algorithms with custom loss functions, etc.

*(2) Consistent improvements to core algorithms*
A less exciting but still very important item will be constantly improving
the core set of algorithms in MLlib. This could mean speed, scaling,
robustness, and usability for the few algorithms which cover 90% of use
cases.

There are plenty of other possibilities, and it will be great to hear the
community's thoughts!

Thanks,
Joseph

-- 

Joseph Bradley

Software Engineer - Machine Learning

Databricks, Inc.

[image: http://databricks.com] <http://databricks.com/>

Reply via email to