[jira] [Comment Edited] (SPARK-15581) MLlib 2.1 Roadmap

Alexander Ulanov (JIRA) Thu, 16 Jun 2016 18:19:19 -0700

    [ 
https://issues.apache.org/jira/browse/SPARK-15581?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15325377#comment-15325377
 ]


Alexander Ulanov edited comment on SPARK-15581 at 6/17/16 1:18 AM:
-------------------------------------------------------------------

I would like to comment on Breeze and deep learning parts, because I have been 
implementing multilayer perceptron for Spark and have used Breeze a lot.

Breeze provides convenient abstraction for dense and sparse vectors and 
matrices and allows performing linear algebra backed by netlib-java and native 
BLAS. At the same time Spark "linalg" has its own abstractions for that. This 
might be confusing to users and developers. Obviously, Spark should have a 
single library for linear algebra. Having said that, Breeze is more convenient 
and flexible than linalg, though it misses some features such as in-place 
matrix multiplications and multidimensional arrays. Breeze cannot be removed 
from Spark because "linalg" does not have enough functionality to fully replace 
it. To address this, I have implemented a Scala tensor library on top of 
netlib-java. "linalg" can be wrapped around it. It also provides functions 
similar to Breeze and allows working with multi-dimensional arrays. [~mengxr], 
[~dbtsai] and myself were planning to discuss this after the 2.0 release, and I 
am posting these considerations here since you raised this question too. Could 
you take a look on this library and tell what do you think? The source code is 
here https://github.com/avulanov/scala-tensor

With regards to deep learning, I believe that having deep learning within 
Spark's ML library is a question of convenience. Spark has broad analytic 
capabilities and it is useful to have deep learning as one of these tools at 
hand. Deep learning is a model of choice for several important modern 
use-cases, and Spark ML might want to cover them. Eventually, it is hard to 
explain, why do we have PCA in ML but don't provide Autoencoder. To summarize 
this, I think that Spark should have at least the most widely used deep 
learning models, such as fully connected artificial neural network, 
convolutional network and autoencoder. Advanced and experimental deep learning 
features might reside within packages or as pluggable external tools. Spark ML 
already has fully connected networks in place. Stacked autoencoder is 
implemented but not merged yet. The only thing that remains is convolutional 
network. These 3 will provide a comprehensive deep learning set for Spark ML. 
We might also include recurrent networks as well.

The additional benefit of implementing deep learning for Spark is that we 
define the Spark ML API for deep learning. This interface is similar to the 
other analytics tools in Spark and supports ML pipelines. This makes deep 
learning easy to use and plug in into analytics workloads for Spark users. One 
can wrap other deep learning implementations with this interface allowing users 
to pick a particular back-end, e.g. Caffe or TensorFlow, along with the default 
one. The interface has to provide few architectures for deep learning that are 
widely used in practice, such as AlexNet.

The ultimate goal will be to provide efficient distributed training. It relies 
heavily on the efficient communication and scheduling mechanisms. The default 
implementation is based on Spark. More efficient implementations might include 
some external libraries but use the same interface defined.


was (Author: avulanov):
I would like to comment on Breeze and deep learning parts, because I have been 
implementing multilayer perceptron for Spark and have used Breeze a lot.

Breeze provides convenient abstraction for dense and sparse vectors and 
matrices and allows performing linear algebra backed by netlib-java and native 
BLAS. At the same time Spark "linalg" has its own abstractions for that. This 
might be confusing to users and developers. Obviously, Spark should have a 
single library for linear algebra. Having said that, Breeze is more convenient 
and flexible than linalg, though it misses some features such as in-place 
matrix multiplications and multidimensional arrays. Breeze cannot be removed 
from Spark because "linalg" does not have enough functionality to fully replace 
it. To address this, I have implemented a Scala tensor library on top of 
netlib-java. "linalg" can be wrapped around it. It also provides functions 
similar to Breeze and allows working with multi-dimensional arrays. [~mengxr], 
[~dbtsai] and myself were planning to discuss this after the 2.0 release, and I 
am posting these considerations here since you raised this question too. Could 
you take a look on this library and tell what do you think? The source code is 
here https://github.com/avulanov/scala-tensor

With regards to deep learning, I believe that having deep learning within 
Spark's ML library is a question of convenience. Spark has broad analytic 
capabilities and it is useful to have deep learning as one of these tools at 
hand. Deep learning is a model of choice for several important modern 
use-cases, and Spark ML might want to cover them. Eventually, it is hard to 
explain, why do we have PCA in ML but don't provide Autoencoder. To summarize 
this, I think that Spark should have at least the most widely used deep 
learning models, such as fully connected artificial neural network, 
convolutional network and autoencoder. Advanced and experimental deep learning 
features might reside within packages or as pluggable external tools. Spark ML 
already has fully connected networks in place. Stacked autoencoder is 
implemented but not merged yet. The only thing that remains is convolutional 
network. These 3 will provide a comprehensive deep learning set for Spark ML.

> MLlib 2.1 Roadmap
> -----------------
>
>                 Key: SPARK-15581
>                 URL: https://issues.apache.org/jira/browse/SPARK-15581
>             Project: Spark
>          Issue Type: Umbrella
>          Components: ML, MLlib
>            Reporter: Joseph K. Bradley
>            Priority: Blocker
>              Labels: roadmap
>
> This is a master list for MLlib improvements we are working on for the next 
> release. Please view this as a wish list rather than a definite plan, for we 
> don't have an accurate estimate of available resources. Due to limited review 
> bandwidth, features appearing on this list will get higher priority during 
> code review. But feel free to suggest new items to the list in comments. We 
> are experimenting with this process. Your feedback would be greatly 
> appreciated.
> h1. Instructions
> h2. For contributors:
> * Please read 
> https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark 
> carefully. Code style, documentation, and unit tests are important.
> * If you are a first-time Spark contributor, please always start with a 
> [starter task|https://issues.apache.org/jira/issues/?filter=12333209] rather 
> than a medium/big feature. Based on our experience, mixing the development 
> process with a big feature usually causes long delay in code review.
> * Never work silently. Let everyone know on the corresponding JIRA page when 
> you start working on some features. This is to avoid duplicate work. For 
> small features, you don't need to wait to get JIRA assigned.
> * For medium/big features or features with dependencies, please get assigned 
> first before coding and keep the ETA updated on the JIRA. If there exist no 
> activity on the JIRA page for a certain amount of time, the JIRA should be 
> released for other contributors.
> * Do not claim multiple (>3) JIRAs at the same time. Try to finish them one 
> after another.
> * Remember to add the `@Since("VERSION")` annotation to new public APIs.
> * Please review others' PRs (https://spark-prs.appspot.com/#mllib). Code 
> review greatly helps to improve others' code as well as yours.
> h2. For committers:
> * Try to break down big features into small and specific JIRA tasks and link 
> them properly.
> * Add a "starter" label to starter tasks.
> * Put a rough estimate for medium/big features and track the progress.
> * If you start reviewing a PR, please add yourself to the Shepherd field on 
> JIRA.
> * If the code looks good to you, please comment "LGTM". For non-trivial PRs, 
> please ping a maintainer to make a final pass.
> * After merging a PR, create and link JIRAs for Python, example code, and 
> documentation if applicable.
> h1. Roadmap (*WIP*)
> This is NOT [a complete list of MLlib JIRAs for 2.1| 
> https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20component%20in%20(ML%2C%20MLlib%2C%20SparkR%2C%20GraphX)%20AND%20%22Target%20Version%2Fs%22%20%3D%202.1.0%20AND%20(fixVersion%20is%20EMPTY%20OR%20fixVersion%20!%3D%202.1.0)%20AND%20(Resolution%20is%20EMPTY%20OR%20Resolution%20in%20(Done%2C%20Fixed%2C%20Implemented))%20ORDER%20BY%20priority].
>  We only include umbrella JIRAs and high-level tasks.
> Major efforts in this release:
> * Feature parity for the DataFrames-based API (`spark.ml`), relative to the 
> RDD-based API
> * ML persistence
> * Python API feature parity and test coverage
> * R API expansion and improvements
> * Note about new features: As usual, we expect to expand the feature set of 
> MLlib.  However, we will prioritize API parity, bug fixes, and improvements 
> over new features.
> Note `spark.mllib` is in maintenance mode now.  We will accept bug fixes for 
> it, but new features, APIs, and improvements will only be added to `spark.ml`.
> h2. Critical feature parity in DataFrame-based API
> * Umbrella JIRA: [SPARK-4591]
> h2. Persistence
> * Complete persistence within MLlib
> ** Python tuning (SPARK-13786)
> * MLlib in R format: compatibility with other languages (SPARK-15572)
> * Impose backwards compatibility for persistence (SPARK-15573)
> h2. Python API
> * Standardize unit tests for Scala and Python to improve and consolidate test 
> coverage for Params, persistence, and other common functionality (SPARK-15571)
> * Improve Python API handling of Params, persistence (SPARK-14771) 
> (SPARK-14706)
> ** Note: The linked JIRAs for this are incomplete.  More to be created...
> ** Related: Implement Python meta-algorithms in Scala (to simplify 
> persistence) (SPARK-15574)
> * Feature parity: The main goal of the Python API is to have feature parity 
> with the Scala/Java API. You can find a [complete list here| 
> https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20status%20in%20(Open%2C%20"In%20Progress"%2C%20Reopened)%20AND%20component%20in%20(ML%2C%20MLlib)%20AND%20component%20in%20(PySpark)%20AND%20"Target%20Version%2Fs"%20%3D%202.1.0%20ORDER%20BY%20priority%20DESC].
>  The tasks fall into two major categories:
> ** Python API for missing methods (SPARK-14813)
> ** Python API for new algorithms. Committers should create a JIRA for the 
> Python API after merging a public feature in Scala/Java.
> h2. SparkR
> * Improve R formula support and implementation (SPARK-15540)
> * Various SparkR ML API and usability improvements
> ** Note: No linked JIRA yet, but need to create an umbrella once more issues 
> are collected.
> * Wrap more MLlib algorithms
> ** GSoC project [SPARK-15069]
> * Release SparkR on CRAN [SPARK-15799]
> h2. Pipeline API
> * Usability: Automatic feature preprocessing [SPARK-11106]
> * ML attribute API improvements (SPARK-8515)
> * test Kaggle datasets (SPARK-9941)
> * See (SPARK-5874) for a list of other possibilities
> h2. Algorithms and performance
> * Trees & ensembles scaling & speed (SPARK-14045), (SPARK-14046), 
> (SPARK-14047)
> * Locality sensitive hashing (LSH) (SPARK-5992)
> * Similarity search / nearest neighbors (SPARK-2336)
> Additional (may be lower priority):
> * robust linear regression with Huber loss (SPARK-3181)
> * vector-free L-BFGS (SPARK-10078)
> * tree partition by features (SPARK-3717)
> * local linear algebra (SPARK-6442)
> * weighted instance support (SPARK-9610)
> ** random forest (SPARK-9478)
> ** GBT (SPARK-9612)
> * deep learning (SPARK-5575)
> ** autoencoder (SPARK-10408)
> ** restricted Boltzmann machine (RBM) (SPARK-4251)
> ** convolutional neural network (stretch)
> * factorization machine (SPARK-7008)
> * distributed LU decomposition (SPARK-8514)
> h2. Other
> * Infra
> ** Testing for example code (SPARK-12347)
> ** Remove breeze from dependencies (SPARK-15575)
> * public dataset loader (SPARK-10388)
> * Documentation: improve organization of user guide (SPARK-8517)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-15581) MLlib 2.1 Roadmap

Reply via email to