[ https://issues.apache.org/jira/browse/SPARK-12626?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15188774#comment-15188774 ]
Nick Pentreath commented on SPARK-12626: ---------------------------------------- [~dbtsai] ok thanks - would like to take a look when it's ready. > MLlib 2.0 Roadmap > ----------------- > > Key: SPARK-12626 > URL: https://issues.apache.org/jira/browse/SPARK-12626 > Project: Spark > Issue Type: Umbrella > Components: ML, MLlib > Reporter: Joseph K. Bradley > Assignee: Xiangrui Meng > Priority: Blocker > Labels: roadmap > > This is a master list for MLlib improvements we plan to have in Spark 2.0. > Please view this list as a wish list rather than a concrete plan, because we > don't have an accurate estimate of available resources. Due to limited review > bandwidth, features appearing on this list will get higher priority during > code review. But feel free to suggest new items to the list in comments. We > are experimenting with this process. Your feedback would be greatly > appreciated. > h1. Instructions > h2. For contributors: > * Please read > https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark > carefully. Code style, documentation, and unit tests are important. > * If you are a first-time Spark contributor, please always start with a > [starter task|https://issues.apache.org/jira/issues/?filter=12333209] rather > than a medium/big feature. Based on our experience, mixing the development > process with a big feature usually causes long delay in code review. > * Never work silently. Let everyone know on the corresponding JIRA page when > you start working on some features. This is to avoid duplicate work. For > small features, you don't need to wait to get JIRA assigned. > * For medium/big features or features with dependencies, please get assigned > first before coding and keep the ETA updated on the JIRA. If there exist no > activity on the JIRA page for a certain amount of time, the JIRA should be > released for other contributors. > * Do not claim multiple (>3) JIRAs at the same time. Try to finish them one > after another. > * Remember to add the `@Since("2.0.0")` annotation to new public APIs. > * Please review others' PRs (https://spark-prs.appspot.com/#mllib). Code > review greatly helps to improve others' code as well as yours. > h2. For committers: > * Try to break down big features into small and specific JIRA tasks and link > them properly. > * Add a "starter" label to starter tasks. > * Put a rough estimate for medium/big features and track the progress. > * If you start reviewing a PR, please add yourself to the Shepherd field on > JIRA. > * If the code looks good to you, please comment "LGTM". For non-trivial PRs, > please ping a maintainer to make a final pass. > * After merging a PR, create and link JIRAs for Python, example code, and > documentation if applicable. > h1. Roadmap (*WIP*) > This is NOT [a complete list of MLlib JIRAs for > 2.0|https://issues.apache.org/jira/issues/?filter=12334385]. We only include > umbrella JIRAs and high-level tasks. > Major efforts in this release: > * `spark.ml`: Achieve feature parity for the `spark.ml` API, relative to the > `spark.mllib` API. This includes the Python API. > * Linear algebra: Separate out the linear algebra library as a standalone > project without a Spark dependency to simplify production deployment. > * Pipelines API: Complete critical improvements to the Pipelines API > * New features: As usual, we expect to expand the feature set of MLlib. > However, we will prioritize API parity over new features. _New algorithms > should be written for `spark.ml`, not `spark.mllib`._ > h2. Algorithms and performance > * iteratively re-weighted least squares (IRLS) for GLMs (SPARK-9835) > * estimator interface for GLMs (SPARK-12811) > * extended support for GLM model families and link functions in SparkR > (SPARK-12566) > * improved model summaries and stats via IRLS (SPARK-9837) > Additional (maybe lower priority): > * robust linear regression with Huber loss (SPARK-3181) > * vector-free L-BFGS (SPARK-10078) > * tree partition by features (SPARK-3717) > * local linear algebra (SPARK-6442) > * weighted instance support (SPARK-9610) > ** random forest (SPARK-9478) > ** GBT (SPARK-9612) > * locality sensitive hashing (LSH) (SPARK-5992) > * deep learning (SPARK-5575) > ** autoencoder (SPARK-10408) > ** restricted Boltzmann machine (RBM) (SPARK-4251) > ** convolutional neural network (stretch) > * factorization machine (SPARK-7008) > * distributed LU decomposition (SPARK-8514) > h2. Statistics > * bivariate statistics as UDAFs (SPARK-10385) > * R-like statistics for GLMs (SPARK-9835) > * sketch algorithms (cross listed) : approximate quantiles (SPARK-6761), > count-min sketch (SPARK-6763), Bloom filter (SPARK-12818) > h2. Pipeline API > * pipeline persistence (SPARK-6725) > ** trees (SPARK-11888) > ** RFormula (SPARK-11891) > ** MLC (SPARK-11871) > ** PySpark (SPARK-11939) --> *This is now ready for people to take up > subtasks!* > * ML attribute API improvements (SPARK-8515) > * predict single instance (SPARK-10413) > * test Kaggle datasets (SPARK-9941) > _There may be other design improvement efforts for Pipelines, to be listed > here soon. See (SPARK-5874) for a list of possibilities._ > h2. Model persistence > * PMML export > ** naive Bayes (SPARK-8546) > ** decision tree (SPARK-8542) > * model save/load > ** FPGrowth (SPARK-6724) > ** PrefixSpan (SPARK-10386) > * code generation > ** decision tree and tree ensembles (SPARK-10387) > h2. Data sources > * public dataset loader (SPARK-10388) > h2. Python API for ML > The main goal of Python API is to have feature parity with Scala/Java API. > You can find a complete list > [here|https://issues.apache.org/jira/issues/?filter=12333214]. The tasks fall > into two major categories: > * Pipeline persistence in PySpark (SPARK-11939) > * Python API for missing methods (SPARK-11937) > * Python API for new algorithms. Committers should create a JIRA for the > Python API after merging a public feature in Scala/Java. > h2. SparkR API for ML > * support more families and link functions in SparkR::glm (SPARK-12566) > * model summary with R-like statistics for GLMs (SPARK-9837) > * support more algorithms (k-means (SPARK-13011), survival analysis > (SPARK-13010), etc.) > h2. Documentation > * re-organize user guide (SPARK-8517) > * make example code testable in user guide (SPARK-11337) > * @Since versions in spark.ml, pyspark.mllib, and pyspark.ml (SPARK-7751) > * Fix param format in pydoc (SPARK-11219) -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org