[ 
https://issues.apache.org/jira/browse/SPARK-5575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15536321#comment-15536321
 ] 

Alexander Ulanov commented on SPARK-5575:
-----------------------------------------

I recently released a package to handle new features that are not yet merged in 
Spark: https://spark-packages.org/package/avulanov/scalable-deeplearning

> Artificial neural networks for MLlib deep learning
> --------------------------------------------------
>
>                 Key: SPARK-5575
>                 URL: https://issues.apache.org/jira/browse/SPARK-5575
>             Project: Spark
>          Issue Type: Umbrella
>          Components: MLlib
>    Affects Versions: 1.2.0
>            Reporter: Alexander Ulanov
>
> *Goal:* Implement various types of artificial neural networks
> *Motivation:* (from https://issues.apache.org/jira/browse/SPARK-15581)
> Having deep learning within Spark's ML library is a question of convenience. 
> Spark has broad analytic capabilities and it is useful to have deep learning 
> as one of these tools at hand. Deep learning is a model of choice for several 
> important modern use-cases, and Spark ML might want to cover them. 
> Eventually, it is hard to explain, why do we have PCA in ML but don't provide 
> Autoencoder. To summarize this, Spark should have at least the most widely 
> used deep learning models, such as fully connected artificial neural network, 
> convolutional network and autoencoder. Advanced and experimental deep 
> learning features might reside within packages or as pluggable external 
> tools. These 3 will provide a comprehensive deep learning set for Spark ML. 
> We might also include recurrent networks as well.
> *Requirements:*
> # Extensible API compatible with Spark ML. Basic abstractions such as Neuron, 
> Layer, Error, Regularization, Forward and Backpropagation etc. should be 
> implemented as traits or interfaces, so they can be easily extended or 
> reused. Define the Spark ML API for deep learning. This interface is similar 
> to the other analytics tools in Spark and supports ML pipelines. This makes 
> deep learning easy to use and plug in into analytics workloads for Spark 
> users. 
> # Efficiency. The current implementation of multilayer perceptron in Spark is 
> less than 2x slower than Caffe, both measured on CPU. The main overhead 
> sources are JVM and Spark's communication layer. For more details, please 
> refer to https://github.com/avulanov/ann-benchmark. Having said that, the 
> efficient implementation of deep learning in Spark should be only few times 
> slower than in specialized tool. This is very reasonable for the platform 
> that does much more than deep learning and I believe it is understood by the 
> community.
> # Scalability. Implement efficient distributed training. It relies heavily on 
> the efficient communication and scheduling mechanisms. The default 
> implementation is based on Spark. More efficient implementations might 
> include some external libraries but use the same interface defined.
> *Main features:* 
> # Multilayer perceptron classifier (MLP)
> # Autoencoder
> # Convolutional neural networks for computer vision. The interface has to 
> provide few architectures for deep learning that are widely used in practice, 
> such as AlexNet
> *Additional features:*
> # Other architectures, such as Recurrent neural network (RNN), Long-short 
> term memory (LSTM), Restricted boltzmann machine (RBM), deep belief network 
> (DBN), MLP multivariate regression
> # Regularizers, such as L1, L2, drop-out
> # Normalizers
> # Network customization. The internal API of Spark ANN is designed to be 
> flexible and can handle different types of layers. However, only a part of 
> the API is made public. We have to limit the number of public classes in 
> order to make it simpler to support other languages. This forces us to use 
> (String or Number) parameters instead of introducing of new public classes. 
> One of the options to specify the architecture of ANN is to use text 
> configuration with layer-wise description. We have considered using Caffe 
> format for this. It gives the benefit of compatibility with well known deep 
> learning tool and simplifies the support of other languages in Spark. 
> Implementation of a parser for the subset of Caffe format might be the first 
> step towards the support of general ANN architectures in Spark. 
> # Hardware specific optimization. One can wrap other deep learning 
> implementations with this interface allowing users to pick a particular 
> back-end, e.g. Caffe or TensorFlow, along with the default one. The interface 
> has to provide few architectures for deep learning that are widely used in 
> practice, such as AlexNet. The main motivation for using specialized 
> libraries for deep learning would be to fully take advantage of the hardware 
> where Spark runs, in particular GPUs. Having the default interface in Spark, 
> we will need to wrap only a subset of functions from a given specialized 
> library. It does require an effort, however it is not the same as wrapping 
> all functions. Wrappers can be provided as packages without the need to pull 
> new dependencies into Spark.
> *Completed (merged to the main Spark branch):*
> * Requirements: https://issues.apache.org/jira/browse/SPARK-9471
> ** API 
> https://spark-summit.org/eu-2015/events/a-scalable-implementation-of-deep-learning-on-spark/
> ** Efficiency & Scalability: https://github.com/avulanov/ann-benchmark
> * Features:
> ** Multilayer perceptron classifier 
> https://issues.apache.org/jira/browse/SPARK-9471
> *In progress (pull request):*
> * Features:
> ** Autoencoder https://issues.apache.org/jira/browse/SPARK-2623
> * Additional features:
> ** MLP regression: https://issues.apache.org/jira/browse/SPARK-10409
> *Scalable deep learning package:* 
> * This package is intended for new Spark deep learning features that were not 
> yet merged to Spark ML or that are too specific to be merged: 
> https://spark-packages.org/package/avulanov/scalable-deeplearning



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to