[jira] [Commented] (SPARK-12626) MLlib 2.0 Roadmap

Nick Pentreath (JIRA) Wed, 09 Mar 2016 22:40:49 -0800

    [ 
https://issues.apache.org/jira/browse/SPARK-12626?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15188774#comment-15188774
 ]


Nick Pentreath commented on SPARK-12626:
----------------------------------------

[~dbtsai] ok thanks - would like to take a look when it's ready.

> MLlib 2.0 Roadmap
> -----------------
>
>                 Key: SPARK-12626
>                 URL: https://issues.apache.org/jira/browse/SPARK-12626
>             Project: Spark
>          Issue Type: Umbrella
>          Components: ML, MLlib
>            Reporter: Joseph K. Bradley
>            Assignee: Xiangrui Meng
>            Priority: Blocker
>              Labels: roadmap
>
> This is a master list for MLlib improvements we plan to have in Spark 2.0. 
> Please view this list as a wish list rather than a concrete plan, because we 
> don't have an accurate estimate of available resources. Due to limited review 
> bandwidth, features appearing on this list will get higher priority during 
> code review. But feel free to suggest new items to the list in comments. We 
> are experimenting with this process. Your feedback would be greatly 
> appreciated.
> h1. Instructions
> h2. For contributors:
> * Please read 
> https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark 
> carefully. Code style, documentation, and unit tests are important.
> * If you are a first-time Spark contributor, please always start with a 
> [starter task|https://issues.apache.org/jira/issues/?filter=12333209] rather 
> than a medium/big feature. Based on our experience, mixing the development 
> process with a big feature usually causes long delay in code review.
> * Never work silently. Let everyone know on the corresponding JIRA page when 
> you start working on some features. This is to avoid duplicate work. For 
> small features, you don't need to wait to get JIRA assigned.
> * For medium/big features or features with dependencies, please get assigned 
> first before coding and keep the ETA updated on the JIRA. If there exist no 
> activity on the JIRA page for a certain amount of time, the JIRA should be 
> released for other contributors.
> * Do not claim multiple (>3) JIRAs at the same time. Try to finish them one 
> after another.
> * Remember to add the `@Since("2.0.0")` annotation to new public APIs.
> * Please review others' PRs (https://spark-prs.appspot.com/#mllib). Code 
> review greatly helps to improve others' code as well as yours.
> h2. For committers:
> * Try to break down big features into small and specific JIRA tasks and link 
> them properly.
> * Add a "starter" label to starter tasks.
> * Put a rough estimate for medium/big features and track the progress.
> * If you start reviewing a PR, please add yourself to the Shepherd field on 
> JIRA.
> * If the code looks good to you, please comment "LGTM". For non-trivial PRs, 
> please ping a maintainer to make a final pass.
> * After merging a PR, create and link JIRAs for Python, example code, and 
> documentation if applicable.
> h1. Roadmap (*WIP*)
> This is NOT [a complete list of MLlib JIRAs for 
> 2.0|https://issues.apache.org/jira/issues/?filter=12334385]. We only include 
> umbrella JIRAs and high-level tasks.
> Major efforts in this release:
> * `spark.ml`: Achieve feature parity for the `spark.ml` API, relative to the 
> `spark.mllib` API.  This includes the Python API.
> * Linear algebra: Separate out the linear algebra library as a standalone 
> project without a Spark dependency to simplify production deployment.
> * Pipelines API: Complete critical improvements to the Pipelines API
> * New features: As usual, we expect to expand the feature set of MLlib.  
> However, we will prioritize API parity over new features.  _New algorithms 
> should be written for `spark.ml`, not `spark.mllib`._
> h2. Algorithms and performance
> * iteratively re-weighted least squares (IRLS) for GLMs (SPARK-9835)
> * estimator interface for GLMs (SPARK-12811)
> * extended support for GLM model families and link functions in SparkR 
> (SPARK-12566)
> * improved model summaries and stats via IRLS (SPARK-9837)
> Additional (maybe lower priority):
> * robust linear regression with Huber loss (SPARK-3181)
> * vector-free L-BFGS (SPARK-10078)
> * tree partition by features (SPARK-3717)
> * local linear algebra (SPARK-6442)
> * weighted instance support (SPARK-9610)
> ** random forest (SPARK-9478)
> ** GBT (SPARK-9612)
> * locality sensitive hashing (LSH) (SPARK-5992)
> * deep learning (SPARK-5575)
> ** autoencoder (SPARK-10408)
> ** restricted Boltzmann machine (RBM) (SPARK-4251)
> ** convolutional neural network (stretch)
> * factorization machine (SPARK-7008)
> * distributed LU decomposition (SPARK-8514)
> h2. Statistics
> * bivariate statistics as UDAFs (SPARK-10385)
> * R-like statistics for GLMs (SPARK-9835)
> * sketch algorithms (cross listed) : approximate quantiles (SPARK-6761), 
> count-min sketch (SPARK-6763), Bloom filter (SPARK-12818)
> h2. Pipeline API
> * pipeline persistence (SPARK-6725)
> ** trees (SPARK-11888)
> ** RFormula (SPARK-11891)
> ** MLC (SPARK-11871)
> ** PySpark (SPARK-11939) --> *This is now ready for people to take up 
> subtasks!*
> * ML attribute API improvements (SPARK-8515)
> * predict single instance (SPARK-10413)
> * test Kaggle datasets (SPARK-9941)
> _There may be other design improvement efforts for Pipelines, to be listed 
> here soon.  See (SPARK-5874) for a list of possibilities._
> h2. Model persistence
> * PMML export
> ** naive Bayes (SPARK-8546)
> ** decision tree (SPARK-8542)
> * model save/load
> ** FPGrowth (SPARK-6724)
> ** PrefixSpan (SPARK-10386)
> * code generation
> ** decision tree and tree ensembles (SPARK-10387)
> h2. Data sources
> * public dataset loader (SPARK-10388)
> h2. Python API for ML
> The main goal of Python API is to have feature parity with Scala/Java API. 
> You can find a complete list 
> [here|https://issues.apache.org/jira/issues/?filter=12333214]. The tasks fall 
> into two major categories:
> * Pipeline persistence in PySpark (SPARK-11939)
> * Python API for missing methods (SPARK-11937)
> * Python API for new algorithms. Committers should create a JIRA for the 
> Python API after merging a public feature in Scala/Java.
> h2. SparkR API for ML
> * support more families and link functions in SparkR::glm (SPARK-12566)
> * model summary with R-like statistics for GLMs (SPARK-9837)
> * support more algorithms (k-means (SPARK-13011), survival analysis 
> (SPARK-13010), etc.)
> h2. Documentation
> * re-organize user guide (SPARK-8517)
> * make example code testable in user guide (SPARK-11337)
> * @Since versions in spark.ml, pyspark.mllib, and pyspark.ml (SPARK-7751)
> * Fix param format in pydoc (SPARK-11219)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-12626) MLlib 2.0 Roadmap

Reply via email to