[ https://issues.apache.org/jira/browse/SPARK-15581?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15368318#comment-15368318 ]
Nick Pentreath commented on SPARK-15581: ---------------------------------------- I think it would be a pretty interesting to explore a (probably fairly experimental) mechanism to train on structured streams/DFs and sink to a "prediction" stream and/or some "model state store" > MLlib 2.1 Roadmap > ----------------- > > Key: SPARK-15581 > URL: https://issues.apache.org/jira/browse/SPARK-15581 > Project: Spark > Issue Type: Umbrella > Components: ML, MLlib > Reporter: Joseph K. Bradley > Priority: Blocker > Labels: roadmap > > This is a master list for MLlib improvements we are working on for the next > release. Please view this as a wish list rather than a definite plan, for we > don't have an accurate estimate of available resources. Due to limited review > bandwidth, features appearing on this list will get higher priority during > code review. But feel free to suggest new items to the list in comments. We > are experimenting with this process. Your feedback would be greatly > appreciated. > h1. Instructions > h2. For contributors: > * Please read > https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark > carefully. Code style, documentation, and unit tests are important. > * If you are a first-time Spark contributor, please always start with a > [starter task|https://issues.apache.org/jira/issues/?filter=12333209] rather > than a medium/big feature. Based on our experience, mixing the development > process with a big feature usually causes long delay in code review. > * Never work silently. Let everyone know on the corresponding JIRA page when > you start working on some features. This is to avoid duplicate work. For > small features, you don't need to wait to get JIRA assigned. > * For medium/big features or features with dependencies, please get assigned > first before coding and keep the ETA updated on the JIRA. If there exist no > activity on the JIRA page for a certain amount of time, the JIRA should be > released for other contributors. > * Do not claim multiple (>3) JIRAs at the same time. Try to finish them one > after another. > * Remember to add the `@Since("VERSION")` annotation to new public APIs. > * Please review others' PRs (https://spark-prs.appspot.com/#mllib). Code > review greatly helps to improve others' code as well as yours. > h2. For committers: > * Try to break down big features into small and specific JIRA tasks and link > them properly. > * Add a "starter" label to starter tasks. > * Put a rough estimate for medium/big features and track the progress. > * If you start reviewing a PR, please add yourself to the Shepherd field on > JIRA. > * If the code looks good to you, please comment "LGTM". For non-trivial PRs, > please ping a maintainer to make a final pass. > * After merging a PR, create and link JIRAs for Python, example code, and > documentation if applicable. > h1. Roadmap (*WIP*) > This is NOT [a complete list of MLlib JIRAs for 2.1| > https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20component%20in%20(ML%2C%20MLlib%2C%20SparkR%2C%20GraphX)%20AND%20%22Target%20Version%2Fs%22%20%3D%202.1.0%20AND%20(fixVersion%20is%20EMPTY%20OR%20fixVersion%20!%3D%202.1.0)%20AND%20(Resolution%20is%20EMPTY%20OR%20Resolution%20in%20(Done%2C%20Fixed%2C%20Implemented))%20ORDER%20BY%20priority]. > We only include umbrella JIRAs and high-level tasks. > Major efforts in this release: > * Feature parity for the DataFrames-based API (`spark.ml`), relative to the > RDD-based API > * ML persistence > * Python API feature parity and test coverage > * R API expansion and improvements > * Note about new features: As usual, we expect to expand the feature set of > MLlib. However, we will prioritize API parity, bug fixes, and improvements > over new features. > Note `spark.mllib` is in maintenance mode now. We will accept bug fixes for > it, but new features, APIs, and improvements will only be added to `spark.ml`. > h2. Critical feature parity in DataFrame-based API > * Umbrella JIRA: [SPARK-4591] > h2. Persistence > * Complete persistence within MLlib > ** Python tuning (SPARK-13786) > * MLlib in R format: compatibility with other languages (SPARK-15572) > * Impose backwards compatibility for persistence (SPARK-15573) > h2. Python API > * Standardize unit tests for Scala and Python to improve and consolidate test > coverage for Params, persistence, and other common functionality (SPARK-15571) > * Improve Python API handling of Params, persistence (SPARK-14771) > (SPARK-14706) > ** Note: The linked JIRAs for this are incomplete. More to be created... > ** Related: Implement Python meta-algorithms in Scala (to simplify > persistence) (SPARK-15574) > * Feature parity: The main goal of the Python API is to have feature parity > with the Scala/Java API. You can find a [complete list here| > https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20status%20in%20(Open%2C%20"In%20Progress"%2C%20Reopened)%20AND%20component%20in%20(ML%2C%20MLlib)%20AND%20component%20in%20(PySpark)%20AND%20"Target%20Version%2Fs"%20%3D%202.1.0%20ORDER%20BY%20priority%20DESC]. > The tasks fall into two major categories: > ** Python API for missing methods (SPARK-14813) > ** Python API for new algorithms. Committers should create a JIRA for the > Python API after merging a public feature in Scala/Java. > h2. SparkR > * Improve R formula support and implementation (SPARK-15540) > * Various SparkR ML API and usability improvements > ** Note: No linked JIRA yet, but need to create an umbrella once more issues > are collected. > * Wrap more MLlib algorithms (SPARK-16442) > * Release SparkR on CRAN [SPARK-15799] > h2. Pipeline API > * Usability: Automatic feature preprocessing [SPARK-11106] > * ML attribute API improvements (SPARK-8515) > * test Kaggle datasets (SPARK-9941) > * See (SPARK-5874) for a list of other possibilities > h2. Algorithms and performance > * Trees & ensembles scaling & speed (SPARK-14045), (SPARK-14046), > (SPARK-14047) > * Locality sensitive hashing (LSH) (SPARK-5992) > * Similarity search / nearest neighbors (SPARK-2336) > Additional (may be lower priority): > * robust linear regression with Huber loss (SPARK-3181) > * vector-free L-BFGS (SPARK-10078) > * tree partition by features (SPARK-3717) > * local linear algebra (SPARK-6442) > * weighted instance support (SPARK-9610) > ** random forest (SPARK-9478) > ** GBT (SPARK-9612) > * deep learning (SPARK-5575) > ** autoencoder (SPARK-10408) > ** restricted Boltzmann machine (RBM) (SPARK-4251) > ** convolutional neural network (stretch) > * factorization machine (SPARK-7008) > * distributed LU decomposition (SPARK-8514) > h2. Other > * Infra > ** Testing for example code (SPARK-12347) > ** Remove breeze from dependencies (SPARK-15575) > * public dataset loader (SPARK-10388) > * Documentation: improve organization of user guide (SPARK-8517) -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org