[jira] [Commented] (SPARK-15581) MLlib 2.1 Roadmap

Joseph K. Bradley (JIRA) Fri, 09 Dec 2016 15:39:01 -0800

    [ 
https://issues.apache.org/jira/browse/SPARK-15581?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15736685#comment-15736685
 ]


Joseph K. Bradley commented on SPARK-15581:
-------------------------------------------

I like a lot of the points made here.  A few thoughts with each:

* Clearly messaging what we WILL get done + limiting promises based on reviewer 
bandwidth
** I'll try to draft some ideas for how to do this.  I'd really like to make 
good use of JIRA fields like Target Version, Priority, and labels in order to 
make it easy to write searches to help contributors explore JIRA.

* Umbrella vs. specific JIRAs.  Broad efforts vs. targeted efforts.
** [~holdenk] I like umbrellas for organization, coverage, and coordination, 
and I agree with you that we should not get rid of them---and that the answer 
is to be stricter about specifying Priority.

* Short-term (next minor release) vs long-term (next major release) efforts
** I worry about promising specific JIRAs by the next major release because 
those JIRAs could easily pile up to make the final list huge.  We will have to 
limit those to critical or breaking changes.

* Open JIRAs not on roadmaps
** The roadmap could have links to tags to help users find and participate on 
these conversations.

* Spark R: I don't have full solutions but do have a few concrete suggestions:
** Committers (myself included) need to be more diligent about creating 
follow-up tasks.  When any new API is added in Scala, the committer or 
contributor should create follow-up tasks for Python, R, and documentation, and 
those should be targeted at the same release.  I.e., when a committer agrees to 
shepherd a feature, they agree to shepherd all language APIs and docs.
** As far as how to make R easier to work with, I'll take your suggestions!
** Supporting Pipelines and advanced use cases: There really needs to be more 
design discussion around SparkR.  [~felixcheung] would you be interested in 
leading some discussion?  I'm envisioning something similar to what was done a 
while back for Pipelines in Scala/Java/Python, where we consider several use 
cases of MLlib: fitting a single model, creating and tuning a complex Pipeline, 
and working with multiple languages.  That should help inform what APIs should 
look like in Spark R.

[~sethah] Thanks for aggregating that list of issues!  I do think it's a pretty 
ambitious list for one release, but I'll definitely use it to help identify 
items I'd like to mark myself down for shepherding in the next release.

> MLlib 2.1 Roadmap
> -----------------
>
>                 Key: SPARK-15581
>                 URL: https://issues.apache.org/jira/browse/SPARK-15581
>             Project: Spark
>          Issue Type: Umbrella
>          Components: ML, MLlib
>            Reporter: Joseph K. Bradley
>            Priority: Blocker
>              Labels: roadmap
>             Fix For: 2.1.0
>
>
> This is a master list for MLlib improvements we are working on for the next 
> release. Please view this as a wish list rather than a definite plan, for we 
> don't have an accurate estimate of available resources. Due to limited review 
> bandwidth, features appearing on this list will get higher priority during 
> code review. But feel free to suggest new items to the list in comments. We 
> are experimenting with this process. Your feedback would be greatly 
> appreciated.
> h1. Instructions
> h2. For contributors:
> * Please read 
> https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark 
> carefully. Code style, documentation, and unit tests are important.
> * If you are a first-time Spark contributor, please always start with a 
> [starter task|https://issues.apache.org/jira/issues/?filter=12333209] rather 
> than a medium/big feature. Based on our experience, mixing the development 
> process with a big feature usually causes long delay in code review.
> * Never work silently. Let everyone know on the corresponding JIRA page when 
> you start working on some features. This is to avoid duplicate work. For 
> small features, you don't need to wait to get JIRA assigned.
> * For medium/big features or features with dependencies, please get assigned 
> first before coding and keep the ETA updated on the JIRA. If there exist no 
> activity on the JIRA page for a certain amount of time, the JIRA should be 
> released for other contributors.
> * Do not claim multiple (>3) JIRAs at the same time. Try to finish them one 
> after another.
> * Remember to add the `@Since("VERSION")` annotation to new public APIs.
> * Please review others' PRs (https://spark-prs.appspot.com/#mllib). Code 
> review greatly helps to improve others' code as well as yours.
> h2. For committers:
> * Try to break down big features into small and specific JIRA tasks and link 
> them properly.
> * Add a "starter" label to starter tasks.
> * Put a rough estimate for medium/big features and track the progress.
> * If you start reviewing a PR, please add yourself to the Shepherd field on 
> JIRA.
> * If the code looks good to you, please comment "LGTM". For non-trivial PRs, 
> please ping a maintainer to make a final pass.
> * After merging a PR, create and link JIRAs for Python, example code, and 
> documentation if applicable.
> h1. Roadmap (*WIP*)
> This is NOT [a complete list of MLlib JIRAs for 2.1| 
> https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20component%20in%20(ML%2C%20MLlib%2C%20SparkR%2C%20GraphX)%20AND%20%22Target%20Version%2Fs%22%20%3D%202.1.0%20AND%20(fixVersion%20is%20EMPTY%20OR%20fixVersion%20!%3D%202.1.0)%20AND%20(Resolution%20is%20EMPTY%20OR%20Resolution%20in%20(Done%2C%20Fixed%2C%20Implemented))%20ORDER%20BY%20priority].
>  We only include umbrella JIRAs and high-level tasks.
> Major efforts in this release:
> * Feature parity for the DataFrames-based API (`spark.ml`), relative to the 
> RDD-based API
> * ML persistence
> * Python API feature parity and test coverage
> * R API expansion and improvements
> * Note about new features: As usual, we expect to expand the feature set of 
> MLlib.  However, we will prioritize API parity, bug fixes, and improvements 
> over new features.
> Note `spark.mllib` is in maintenance mode now.  We will accept bug fixes for 
> it, but new features, APIs, and improvements will only be added to `spark.ml`.
> h2. Critical feature parity in DataFrame-based API
> * Umbrella JIRA: [SPARK-4591]
> h2. Persistence
> * Complete persistence within MLlib
> ** Python tuning (SPARK-13786)
> * MLlib in R format: compatibility with other languages (SPARK-15572)
> * Impose backwards compatibility for persistence (SPARK-15573)
> h2. Python API
> * Standardize unit tests for Scala and Python to improve and consolidate test 
> coverage for Params, persistence, and other common functionality (SPARK-15571)
> * Improve Python API handling of Params, persistence (SPARK-14771) 
> (SPARK-14706)
> ** Note: The linked JIRAs for this are incomplete.  More to be created...
> ** Related: Implement Python meta-algorithms in Scala (to simplify 
> persistence) (SPARK-15574)
> * Feature parity: The main goal of the Python API is to have feature parity 
> with the Scala/Java API. You can find a [complete list here| 
> https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20status%20in%20(Open%2C%20"In%20Progress"%2C%20Reopened)%20AND%20component%20in%20(ML%2C%20MLlib)%20AND%20component%20in%20(PySpark)%20AND%20"Target%20Version%2Fs"%20%3D%202.1.0%20ORDER%20BY%20priority%20DESC].
>  The tasks fall into two major categories:
> ** Python API for missing methods (SPARK-14813)
> ** Python API for new algorithms. Committers should create a JIRA for the 
> Python API after merging a public feature in Scala/Java.
> h2. SparkR
> * Improve R formula support and implementation (SPARK-15540)
> * Various SparkR ML API and usability improvements
> ** Note: No linked JIRA yet, but need to create an umbrella once more issues 
> are collected.
> * Wrap more MLlib algorithms (SPARK-16442)
> * Release SparkR on CRAN [SPARK-15799]
> h2. Pipeline API
> * Usability: Automatic feature preprocessing [SPARK-11106]
> * ML attribute API improvements (SPARK-8515)
> * test Kaggle datasets (SPARK-9941)
> * See (SPARK-5874) for a list of other possibilities
> h2. Algorithms and performance
> * Trees & ensembles scaling & speed (SPARK-14045), (SPARK-14046), 
> (SPARK-14047)
> * Locality sensitive hashing (LSH) (SPARK-5992)
> * Similarity search / nearest neighbors (SPARK-2336)
> Additional (may be lower priority):
> * robust linear regression with Huber loss (SPARK-3181)
> * vector-free L-BFGS (SPARK-10078)
> * tree partition by features (SPARK-3717)
> * local linear algebra (SPARK-6442)
> * weighted instance support (SPARK-9610)
> ** random forest (SPARK-9478)
> ** GBT (SPARK-9612)
> * deep learning (SPARK-5575)
> ** autoencoder (SPARK-10408)
> ** restricted Boltzmann machine (RBM) (SPARK-4251)
> ** convolutional neural network (stretch)
> * factorization machine (SPARK-7008)
> * distributed LU decomposition (SPARK-8514)
> h2. Other
> * Infra
> ** Testing for example code (SPARK-12347)
> ** Remove breeze from dependencies (SPARK-15575)
> * public dataset loader (SPARK-10388)
> * Documentation: improve organization of user guide (SPARK-8517)
> * Python Documentation: expose default values of params in some way 
> (SPARK-15130)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-15581) MLlib 2.1 Roadmap

Reply via email to