[ 
https://issues.apache.org/jira/browse/SPARK-10324?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-10324:
----------------------------------
    Priority: Blocker  (was: Critical)

> MLlib 1.6 Roadmap
> -----------------
>
>                 Key: SPARK-10324
>                 URL: https://issues.apache.org/jira/browse/SPARK-10324
>             Project: Spark
>          Issue Type: Umbrella
>          Components: ML, MLlib
>            Reporter: Xiangrui Meng
>            Assignee: Xiangrui Meng
>            Priority: Blocker
>
> Following SPARK-8445, we created this master list for MLlib features we plan 
> to have in Spark 1.6. Please view this list as a wish list rather than a 
> concrete plan, because we don't have an accurate estimate of available 
> resources. Due to limited review bandwidth, features appearing on this list 
> will get higher priority during code review. But feel free to suggest new 
> items to the list in comments. We are experimenting with this process. Your 
> feedback would be greatly appreciated.
> h1. Instructions
> h2. For contributors:
> * Please read 
> https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark 
> carefully. Code style, documentation, and unit tests are important.
> * If you are a first-time Spark contributor, please always start with a 
> [starter task|https://issues.apache.org/jira/issues/?filter=12333209] rather 
> than a medium/big feature. Based on our experience, mixing the development 
> process with a big feature usually causes long delay in code review.
> * Never work silently. Let everyone know on the corresponding JIRA page when 
> you start working on some features. This is to avoid duplicate work. For 
> small features, you don't need to wait to get JIRA assigned.
> * For medium/big features or features with dependencies, please get assigned 
> first before coding and keep the ETA updated on the JIRA. If there exist no 
> activity on the JIRA page for a certain amount of time, the JIRA should be 
> released for other contributors.
> * Do not claim multiple (>3) JIRAs at the same time. Try to finish them one 
> after another.
> * Remember to add `@Since("1.6.0")` annotation to new public APIs.
> * Please review others' PRs (https://spark-prs.appspot.com/#mllib). Code 
> review greatly helps improve others' code as well as yours.
> h2. For committers:
> * Try to break down big features into small and specific JIRA tasks and link 
> them properly.
> * Add "starter" label to starter tasks.
> * Put a rough estimate for medium/big features and track the progress.
> * If you start reviewing a PR, please add yourself to the Shepherd field on 
> JIRA.
> * If the code looks good to you, please comment "LGTM". For non-trivial PRs, 
> please ping a maintainer to make a final pass.
> * After merging a PR, create and link JIRAs for Python, example code, and 
> documentation if necessary.
> h1. Roadmap (WIP)
> This is NOT [a complete list of MLlib JIRAs for 
> 1.6|https://issues.apache.org/jira/issues/?filter=12333208]. We only include 
> umbrella JIRAs and high-level tasks.
> h2. Algorithms and performance
> * log-linear model for survival analysis (SPARK-8518)
> * normal equation approach for linear regression (SPARK-9834)
> * iteratively re-weighted least squares (IRLS) for GLMs (SPARK-9835)
> * robust linear regression with Huber loss (SPARK-3181)
> * vector-free L-BFGS (SPARK-10078)
> * tree partition by features (SPARK-3717)
> * bisecting k-means (SPARK-6517)
> * weighted instance support (SPARK-9610)
> ** logistic regression (SPARK-7685)
> ** linear regression (SPARK-9642)
> ** random forest (SPARK-9478)
> * locality sensitive hashing (LSH) (SPARK-5992)
> * deep learning (SPARK-2352)
> ** autoencoder (SPARK-4288)
> ** restricted Boltzmann machine (RBM) (SPARK-4251)
> ** convolutional neural network (stretch)
> * factorization machine (SPARK-7008)
> * local linear algebra (SPARK-6442)
> * distributed LU decomposition (SPARK-8514)
> h2. Statistics
> * univariate statistics as UDAFs (SPARK-10384)
> * bivariate statistics as UDAFs (SPARK-10385)
> * R-like statistics for GLMs (SPARK-9835)
> * online hypothesis testing (SPARK-3147)
> h2. Pipeline API
> * pipeline persistence (SPARK-6725)
> * ML attribute API improvements (SPARK-8515)
> * feature transformers (SPARK-9930)
> ** feature interaction (SPARK-9698)
> ** SQL transformer (SPARK-8345)
> ** ??
> * test Kaggle datasets (SPARK-9941)
> h2. Model persistence
> * PMML export
> ** naive Bayes (SPARK-8546)
> ** decision tree (SPARK-8542)
> * model save/load
> ** FPGrowth (SPARK-6724)
> ** PrefixSpan (SPARK-10386)
> * code generation
> ** decision tree and tree ensembles (SPARK-10387)
> h2. Data sources
> * LIBSVM data source (SPARK-10117)
> * public dataset loader (SPARK-10388)
> h2. Python API for ML
> The main goal of Python API is to have feature parity with Scala/Java API.
> * Python API for new algorithms
> * Python API for missing methods
> h2. SparkR API for ML
> * support more families and link functions in SparkR::glm (SPARK-9838, 
> SPARK-9839, SPARK-9840)
> * better R formula support (SPARK-9681)
> * model summary with R-like statistics for GLMs (SPARK-9836, SPARK-9837)
> h2. Documentation
> * re-organize user guide (SPARK-8517)
> * @Since versions in spark.ml, pyspark.mllib, and pyspark.ml (SPARK-7751)
> * automatically test example code in user guide (SPARK-10382)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to