[ https://issues.apache.org/jira/browse/SPARK-10324?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Xiangrui Meng updated SPARK-10324: ---------------------------------- Priority: Blocker (was: Critical) > MLlib 1.6 Roadmap > ----------------- > > Key: SPARK-10324 > URL: https://issues.apache.org/jira/browse/SPARK-10324 > Project: Spark > Issue Type: Umbrella > Components: ML, MLlib > Reporter: Xiangrui Meng > Assignee: Xiangrui Meng > Priority: Blocker > > Following SPARK-8445, we created this master list for MLlib features we plan > to have in Spark 1.6. Please view this list as a wish list rather than a > concrete plan, because we don't have an accurate estimate of available > resources. Due to limited review bandwidth, features appearing on this list > will get higher priority during code review. But feel free to suggest new > items to the list in comments. We are experimenting with this process. Your > feedback would be greatly appreciated. > h1. Instructions > h2. For contributors: > * Please read > https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark > carefully. Code style, documentation, and unit tests are important. > * If you are a first-time Spark contributor, please always start with a > [starter task|https://issues.apache.org/jira/issues/?filter=12333209] rather > than a medium/big feature. Based on our experience, mixing the development > process with a big feature usually causes long delay in code review. > * Never work silently. Let everyone know on the corresponding JIRA page when > you start working on some features. This is to avoid duplicate work. For > small features, you don't need to wait to get JIRA assigned. > * For medium/big features or features with dependencies, please get assigned > first before coding and keep the ETA updated on the JIRA. If there exist no > activity on the JIRA page for a certain amount of time, the JIRA should be > released for other contributors. > * Do not claim multiple (>3) JIRAs at the same time. Try to finish them one > after another. > * Remember to add `@Since("1.6.0")` annotation to new public APIs. > * Please review others' PRs (https://spark-prs.appspot.com/#mllib). Code > review greatly helps improve others' code as well as yours. > h2. For committers: > * Try to break down big features into small and specific JIRA tasks and link > them properly. > * Add "starter" label to starter tasks. > * Put a rough estimate for medium/big features and track the progress. > * If you start reviewing a PR, please add yourself to the Shepherd field on > JIRA. > * If the code looks good to you, please comment "LGTM". For non-trivial PRs, > please ping a maintainer to make a final pass. > * After merging a PR, create and link JIRAs for Python, example code, and > documentation if necessary. > h1. Roadmap (WIP) > This is NOT [a complete list of MLlib JIRAs for > 1.6|https://issues.apache.org/jira/issues/?filter=12333208]. We only include > umbrella JIRAs and high-level tasks. > h2. Algorithms and performance > * log-linear model for survival analysis (SPARK-8518) > * normal equation approach for linear regression (SPARK-9834) > * iteratively re-weighted least squares (IRLS) for GLMs (SPARK-9835) > * robust linear regression with Huber loss (SPARK-3181) > * vector-free L-BFGS (SPARK-10078) > * tree partition by features (SPARK-3717) > * bisecting k-means (SPARK-6517) > * weighted instance support (SPARK-9610) > ** logistic regression (SPARK-7685) > ** linear regression (SPARK-9642) > ** random forest (SPARK-9478) > * locality sensitive hashing (LSH) (SPARK-5992) > * deep learning (SPARK-2352) > ** autoencoder (SPARK-4288) > ** restricted Boltzmann machine (RBM) (SPARK-4251) > ** convolutional neural network (stretch) > * factorization machine (SPARK-7008) > * local linear algebra (SPARK-6442) > * distributed LU decomposition (SPARK-8514) > h2. Statistics > * univariate statistics as UDAFs (SPARK-10384) > * bivariate statistics as UDAFs (SPARK-10385) > * R-like statistics for GLMs (SPARK-9835) > * online hypothesis testing (SPARK-3147) > h2. Pipeline API > * pipeline persistence (SPARK-6725) > * ML attribute API improvements (SPARK-8515) > * feature transformers (SPARK-9930) > ** feature interaction (SPARK-9698) > ** SQL transformer (SPARK-8345) > ** ?? > * test Kaggle datasets (SPARK-9941) > h2. Model persistence > * PMML export > ** naive Bayes (SPARK-8546) > ** decision tree (SPARK-8542) > * model save/load > ** FPGrowth (SPARK-6724) > ** PrefixSpan (SPARK-10386) > * code generation > ** decision tree and tree ensembles (SPARK-10387) > h2. Data sources > * LIBSVM data source (SPARK-10117) > * public dataset loader (SPARK-10388) > h2. Python API for ML > The main goal of Python API is to have feature parity with Scala/Java API. > * Python API for new algorithms > * Python API for missing methods > h2. SparkR API for ML > * support more families and link functions in SparkR::glm (SPARK-9838, > SPARK-9839, SPARK-9840) > * better R formula support (SPARK-9681) > * model summary with R-like statistics for GLMs (SPARK-9836, SPARK-9837) > h2. Documentation > * re-organize user guide (SPARK-8517) > * @Since versions in spark.ml, pyspark.mllib, and pyspark.ml (SPARK-7751) > * automatically test example code in user guide (SPARK-10382) -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org