[jira] [Updated] (SPARK-8445) MLlib 1.5 Roadmap

Xiangrui Meng (JIRA) Tue, 07 Jul 2015 08:40:40 -0700

     [ 
https://issues.apache.org/jira/browse/SPARK-8445?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Xiangrui Meng updated SPARK-8445:
---------------------------------
    Description: 
We expect to see many MLlib contributors for the 1.5 release. To scale out the 
development, we created this master list for MLlib features we plan to have in 
Spark 1.5. Please view this list as a wish list rather than a concrete plan, 
because we don't have an accurate estimate of available resources. Due to 
limited review bandwidth, features appearing on this list will get higher 
priority during code review. But feel free to suggest new items to the list in 
comments. We are experimenting with this process. Your feedback would be 
greatly appreciated.

h1. Instructions

h2. For contributors:

* Please read 
https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark 
carefully. Code style, documentation, and unit tests are important.
* If you are a first-time Spark contributor, please always start with a 
[starter 
task|https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20status%20in%20(Open%2C%20%22In%20Progress%22%2C%20Reopened)%20AND%20component%20in%20(ML%2C%20MLlib)%20AND%20labels%20%3D%20starter%20AND%20%22Target%20Version%2Fs%22%20%3D%201.5.0]
 rather than a medium/big feature. Based on our experience, mixing the 
development process with a big feature usually causes long delay in code review.
* Never work silently. Let everyone know on the corresponding JIRA page when 
you start working on some features. This is to avoid duplicate work. For small 
features, you don't need to wait to get JIRA assigned.
* For medium/big features or features with dependencies, please get assigned 
first before coding and keep the ETA updated on the JIRA. If there exist no 
activity on the JIRA page for a certain amount of time, the JIRA should be 
released for other contributors.
* Do not claim multiple (>3) JIRAs at the same time. Try to finish them one 
after another.
* Please review others' PRs (https://spark-prs.appspot.com/#mllib). Code review 
greatly helps improve others' code as well as yours.

h2. For committers:

* Try to break down big features into small and specific JIRA tasks and link 
them properly.
* Add "starter" label to starter tasks.
* Put a rough estimate for medium/big features and track the progress.
* If you start reviewing a PR, please add yourself to the Shepherd field on 
JIRA.
* If the code looks good to you, please comment "LGTM". For non-trivial PRs, 
please ping a maintainer to make a final pass.
* After merging a PR, create and link JIRAs for Python, example code, and 
documentation if necessary.

h1. Roadmap (WIP)

This is NOT [a complete list of MLlib JIRAs for 
1.5|https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20status%20in%20(Open%2C%20%22In%20Progress%22%2C%20Reopened)%20AND%20component%20in%20(ML%2C%20MLlib)%20AND%20%22Target%20Version%2Fs%22%20%3D%201.5.0%20ORlllllllllllllllllllllllllllllllllllllDER%20BY%20priority%20DESC].
 We only include umbrella JIRAs and high-level tasks.

h2. Algorithms and performance

* LDA improvements (SPARK-5572)
* Log-linear model for survival analysis (SPARK-8518)
* Improve GLM's scalability on number of features (SPARK-8520)
* Tree and ensembles: Move + cleanup code (SPARK-7131), provide class 
probabilities (SPARK-3727), feature importance (SPARK-5133)
* Improve GMM scalability and stability (SPARK-7206)
* Frequent pattern mining improvements (SPARK-7211)
* R-like stats for ML models (SPARK-7674)
* Generalize classification threshold to multiclass (SPARK-8069)
* A/B testing (SPARK-3147)

h2. Pipeline API

* more feature transformers (SPARK-8521)
* k-means (SPARK-7879)
* naive Bayes (SPARK-8600)
* TrainValidationSplit for tuning (SPARK-8484)
* Isotonic regression (SPARK-8671)

h2. Model persistence

* more PMML export (SPARK-8545)
* model save/load (SPARK-4587)
* pipeline persistence (SPARK-6725)

h2. Python API for ML

* List of issues identified during Spark 1.4 QA: (SPARK-7536)
* Python API for streaming ML algorithms (SPARK-3258)
* Add missing model methods (SPARK-8633)

h2. SparkR API for ML

* ML Pipeline API in SparkR (SPARK-6805)
* model.matrix for DataFrames (SPARK-6823)

h2. Documentation

* [Search for documentation improvements | 
https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20status%20in%20(Open%2C%20%22In%20Progress%22%2C%20Reopened)%20AND%20component%20in%20(Documentation)%20AND%20component%20in%20(ML%2C%20MLlib)]

  was:
We expect to see many MLlib contributors for the 1.5 release. To scale out the 
development, we created this master list for MLlib features we plan to have in 
Spark 1.5. Please view this list as a wish list rather than a concrete plan, 
because we don't have an accurate estimate of available resources. Due to 
limited review bandwidth, features appearing on this list will get higher 
priority during code review. But feel free to suggest new items to the list in 
comments. We are experimenting with this process. Your feedback would be 
greatly appreciated.

h1. Instructions

h2. For contributors:

* Please read 
https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark 
carefully. Code style, documentation, and unit tests are important.
* If you are a first-time Spark contributor, please always start with a 
[starter 
task|https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20status%20in%20(Open%2C%20%22In%20Progress%22%2C%20Reopened)%20AND%20component%20in%20(ML%2C%20MLlib)%20AND%20labels%20%3D%20starter%20AND%20%22Target%20Version%2Fs%22%20%3D%201.5.0]
 rather than a medium/big feature. Based on our experience, mixing the 
development process with a big feature usually causes long delay in code review.
* Never work silently. Let everyone know on the corresponding JIRA page when 
you start working on some features. This is to avoid duplicate work. For small 
features, you don't need to wait to get JIRA assigned.
* For medium/big features or features with dependencies, please get assigned 
first before coding and keep the ETA updated on the JIRA. If there exist no 
activity on the JIRA page for a certain amount of time, the JIRA should be 
released for other contributors.
* Do not claim multiple (>3) JIRAs at the same time. Try to finish them one 
after another.
* Please review others' PRs (https://spark-prs.appspot.com/#mllib). Code review 
greatly helps improve others' code as well as yours.

h2. For committers:

* Try to break down big features into small and specific JIRA tasks and link 
them properly.
* Add "starter" label to starter tasks.
* Put a rough estimate for medium/big features and track the progress.
* After merging a PR, create and link JIRAs for Python, example code, and 
documentation if necessary.

h1. Roadmap (WIP)

This is NOT [a complete list of MLlib JIRAs for 
1.5|https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20status%20in%20(Open%2C%20%22In%20Progress%22%2C%20Reopened)%20AND%20component%20in%20(ML%2C%20MLlib)%20AND%20%22Target%20Version%2Fs%22%20%3D%201.5.0%20ORlllllllllllllllllllllllllllllllllllllDER%20BY%20priority%20DESC].
 We only include umbrella JIRAs and high-level tasks.

h2. Algorithms and performance

* LDA improvements (SPARK-5572)
* Log-linear model for survival analysis (SPARK-8518)
* Improve GLM's scalability on number of features (SPARK-8520)
* Tree and ensembles: Move + cleanup code (SPARK-7131), provide class 
probabilities (SPARK-3727), feature importance (SPARK-5133)
* Improve GMM scalability and stability (SPARK-7206)
* Frequent pattern mining improvements (SPARK-7211)
* R-like stats for ML models (SPARK-7674)
* Generalize classification threshold to multiclass (SPARK-8069)
* A/B testing (SPARK-3147)

h2. Pipeline API

* more feature transformers (SPARK-8521)
* k-means (SPARK-7879)
* naive Bayes (SPARK-8600)
* TrainValidationSplit for tuning (SPARK-8484)
* Isotonic regression (SPARK-8671)

h2. Model persistence

* more PMML export (SPARK-8545)
* model save/load (SPARK-4587)
* pipeline persistence (SPARK-6725)

h2. Python API for ML

* List of issues identified during Spark 1.4 QA: (SPARK-7536)
* Python API for streaming ML algorithms (SPARK-3258)
* Add missing model methods (SPARK-8633)

h2. SparkR API for ML

* ML Pipeline API in SparkR (SPARK-6805)
* model.matrix for DataFrames (SPARK-6823)

h2. Documentation

* [Search for documentation improvements | 
https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20status%20in%20(Open%2C%20%22In%20Progress%22%2C%20Reopened)%20AND%20component%20in%20(Documentation)%20AND%20component%20in%20(ML%2C%20MLlib)]


> MLlib 1.5 Roadmap
> -----------------
>
>                 Key: SPARK-8445
>                 URL: https://issues.apache.org/jira/browse/SPARK-8445
>             Project: Spark
>          Issue Type: Umbrella
>          Components: ML, MLlib
>    Affects Versions: 1.5.0
>            Reporter: Xiangrui Meng
>            Assignee: Xiangrui Meng
>            Priority: Critical
>
> We expect to see many MLlib contributors for the 1.5 release. To scale out 
> the development, we created this master list for MLlib features we plan to 
> have in Spark 1.5. Please view this list as a wish list rather than a 
> concrete plan, because we don't have an accurate estimate of available 
> resources. Due to limited review bandwidth, features appearing on this list 
> will get higher priority during code review. But feel free to suggest new 
> items to the list in comments. We are experimenting with this process. Your 
> feedback would be greatly appreciated.
> h1. Instructions
> h2. For contributors:
> * Please read 
> https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark 
> carefully. Code style, documentation, and unit tests are important.
> * If you are a first-time Spark contributor, please always start with a 
> [starter 
> task|https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20status%20in%20(Open%2C%20%22In%20Progress%22%2C%20Reopened)%20AND%20component%20in%20(ML%2C%20MLlib)%20AND%20labels%20%3D%20starter%20AND%20%22Target%20Version%2Fs%22%20%3D%201.5.0]
>  rather than a medium/big feature. Based on our experience, mixing the 
> development process with a big feature usually causes long delay in code 
> review.
> * Never work silently. Let everyone know on the corresponding JIRA page when 
> you start working on some features. This is to avoid duplicate work. For 
> small features, you don't need to wait to get JIRA assigned.
> * For medium/big features or features with dependencies, please get assigned 
> first before coding and keep the ETA updated on the JIRA. If there exist no 
> activity on the JIRA page for a certain amount of time, the JIRA should be 
> released for other contributors.
> * Do not claim multiple (>3) JIRAs at the same time. Try to finish them one 
> after another.
> * Please review others' PRs (https://spark-prs.appspot.com/#mllib). Code 
> review greatly helps improve others' code as well as yours.
> h2. For committers:
> * Try to break down big features into small and specific JIRA tasks and link 
> them properly.
> * Add "starter" label to starter tasks.
> * Put a rough estimate for medium/big features and track the progress.
> * If you start reviewing a PR, please add yourself to the Shepherd field on 
> JIRA.
> * If the code looks good to you, please comment "LGTM". For non-trivial PRs, 
> please ping a maintainer to make a final pass.
> * After merging a PR, create and link JIRAs for Python, example code, and 
> documentation if necessary.
> h1. Roadmap (WIP)
> This is NOT [a complete list of MLlib JIRAs for 
> 1.5|https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20status%20in%20(Open%2C%20%22In%20Progress%22%2C%20Reopened)%20AND%20component%20in%20(ML%2C%20MLlib)%20AND%20%22Target%20Version%2Fs%22%20%3D%201.5.0%20ORlllllllllllllllllllllllllllllllllllllDER%20BY%20priority%20DESC].
>  We only include umbrella JIRAs and high-level tasks.
> h2. Algorithms and performance
> * LDA improvements (SPARK-5572)
> * Log-linear model for survival analysis (SPARK-8518)
> * Improve GLM's scalability on number of features (SPARK-8520)
> * Tree and ensembles: Move + cleanup code (SPARK-7131), provide class 
> probabilities (SPARK-3727), feature importance (SPARK-5133)
> * Improve GMM scalability and stability (SPARK-7206)
> * Frequent pattern mining improvements (SPARK-7211)
> * R-like stats for ML models (SPARK-7674)
> * Generalize classification threshold to multiclass (SPARK-8069)
> * A/B testing (SPARK-3147)
> h2. Pipeline API
> * more feature transformers (SPARK-8521)
> * k-means (SPARK-7879)
> * naive Bayes (SPARK-8600)
> * TrainValidationSplit for tuning (SPARK-8484)
> * Isotonic regression (SPARK-8671)
> h2. Model persistence
> * more PMML export (SPARK-8545)
> * model save/load (SPARK-4587)
> * pipeline persistence (SPARK-6725)
> h2. Python API for ML
> * List of issues identified during Spark 1.4 QA: (SPARK-7536)
> * Python API for streaming ML algorithms (SPARK-3258)
> * Add missing model methods (SPARK-8633)
> h2. SparkR API for ML
> * ML Pipeline API in SparkR (SPARK-6805)
> * model.matrix for DataFrames (SPARK-6823)
> h2. Documentation
> * [Search for documentation improvements | 
> https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20status%20in%20(Open%2C%20%22In%20Progress%22%2C%20Reopened)%20AND%20component%20in%20(Documentation)%20AND%20component%20in%20(ML%2C%20MLlib)]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-8445) MLlib 1.5 Roadmap

Reply via email to