[jira] [Updated] (SPARK-5575) Artificial neural networks for MLlib deep learning

Alexander Ulanov (JIRA) Wed, 27 Jul 2016 04:15:08 -0700

     [ 
https://issues.apache.org/jira/browse/SPARK-5575?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Alexander Ulanov updated SPARK-5575:
------------------------------------
    Description: 
*Goal:* Implement various types of artificial neural networks

*Motivation:* (from https://issues.apache.org/jira/browse/SPARK-15581)
Having deep learning within Spark's ML library is a question of convenience. 
Spark has broad analytic capabilities and it is useful to have deep learning as 
one of these tools at hand. Deep learning is a model of choice for several 
important modern use-cases, and Spark ML might want to cover them. Eventually, 
it is hard to explain, why do we have PCA in ML but don't provide Autoencoder. 
To summarize this, Spark should have at least the most widely used deep 
learning models, such as fully connected artificial neural network, 
convolutional network and autoencoder. Advanced and experimental deep learning 
features might reside within packages or as pluggable external tools. These 3 
will provide a comprehensive deep learning set for Spark ML. We might also 
include recurrent networks as well.

*Requirements:*
# Extensible API compatible with Spark ML. Basic abstractions such as Neuron, 
Layer, Error, Regularization, Forward and Backpropagation etc. should be 
implemented as traits or interfaces, so they can be easily extended or reused. 
Define the Spark ML API for deep learning. This interface is similar to the 
other analytics tools in Spark and supports ML pipelines. This makes deep 
learning easy to use and plug in into analytics workloads for Spark users. 
# Efficiency. The current implementation of multilayer perceptron in Spark is 
less than 2x slower than Caffe, both measured on CPU. The main overhead sources 
are JVM and Spark's communication layer. For more details, please refer to 
https://github.com/avulanov/ann-benchmark. Having said that, the efficient 
implementation of deep learning in Spark should be only few times slower than 
in specialized tool. This is very reasonable for the platform that does much 
more than deep learning and I believe it is understood by the community.
# Scalability. Implement efficient distributed training. It relies heavily on 
the efficient communication and scheduling mechanisms. The default 
implementation is based on Spark. More efficient implementations might include 
some external libraries but use the same interface defined.

*Main features:* 
# Multilayer perceptron.
# Autoencoder 
# Convolutional neural networks. The interface has to provide few architectures 
for deep learning that are widely used in practice, such as AlexNet.

*Additional features:* (lower priority)
# The internal API of Spark ANN is designed to be flexible and can handle 
different types of layers. However, only a part of the API is made public. We 
have to limit the number of public classes in order to make it simpler to 
support other languages. This forces us to use (String or Number) parameters 
instead of introducing of new public classes. One of the options to specify the 
architecture of ANN is to use text configuration with layer-wise description. 
We have considered using Caffe format for this. It gives the benefit of 
compatibility with well known deep learning tool and simplifies the support of 
other languages in Spark. Implementation of a parser for the subset of Caffe 
format might be the first step towards the support of general ANN architectures 
in Spark. 
# Hardware specific optimization. One can wrap other deep learning 
implementations with this interface allowing users to pick a particular 
back-end, e.g. Caffe or TensorFlow, along with the default one. The interface 
has to provide few architectures for deep learning that are widely used in 
practice, such as AlexNet. The main motivation for using specialized libraries 
for deep learning would be to fully take advantage of the hardware where Spark 
runs, in particular GPUs. Having the default interface in Spark, we will need 
to wrap only a subset of functions from a given specialized library. It does 
require an effort, however it is not the same as wrapping all functions. 
Wrappers can be provided as packages without the need to pull new dependencies 
into Spark.




*Requirements:* 
1) Basic abstractions such as Neuron, Layer, Error, Regularization, Forward and 
Backpropagation etc. should be implemented as traits or interfaces, so they can 
be easily extended or reused
2) Implement complex abstractions, such as feed forward and recurrent networks
3) Implement multilayer perceptron (MLP), convolutional networks (LeNet), 
autoencoder (sparse and denoising), stacked autoencoder, restricted  boltzmann 
machines (RBM), deep belief networks (DBN) etc.
4) Implement or reuse supporting constucts, such as classifiers, normalizers, 
poolers,  etc.

  was:
*Goal:* Implement various types of artificial neural networks

*Motivation:* (from https://issues.apache.org/jira/browse/SPARK-15581)
Having deep learning within Spark's ML library is a question of convenience. 
Spark has broad analytic capabilities and it is useful to have deep learning as 
one of these tools at hand. Deep learning is a model of choice for several 
important modern use-cases, and Spark ML might want to cover them. Eventually, 
it is hard to explain, why do we have PCA in ML but don't provide Autoencoder. 
To summarize this, Spark should have at least the most widely used deep 
learning models, such as fully connected artificial neural network, 
convolutional network and autoencoder. Advanced and experimental deep learning 
features might reside within packages or as pluggable external tools. These 3 
will provide a comprehensive deep learning set for Spark ML. We might also 
include recurrent networks as well.

*Requirements:*
# Implement extensible API compatible with Spark ML. Basic abstractions such as 
Neuron, Layer, Error, Regularization, Forward and Backpropagation etc. should 
be implemented as traits or interfaces, so they can be easily extended or 
reused. 
# Performance. The current implementation of multilayer perceptron in Spark is 
less than 2x slower than Caffe, both measured on CPU. The main overhead sources 
are JVM and Spark's communication layer. For more details, please refer to 
https://github.com/avulanov/ann-benchmark. Having said that, the efficient 
implementation of deep learning in Spark should be only few times slower than 
in specialized tool. This is very reasonable for the platform that does much 
more than deep learning and I believe it is understood by the community.

# Implement efficient distributed training. It relies heavily on the efficient 
communication and scheduling mechanisms. The default implementation is based on 
Spark. More efficient implementations might include some external libraries but 
use the same interface defined.

The additional benefit of implementing deep learning for Spark is that we 
define the Spark ML API for deep learning. This interface is similar to the 
other analytics tools in Spark and supports ML pipelines. This makes deep 
learning easy to use and plug in into analytics workloads for Spark users. 

One can wrap other deep learning implementations with this interface allowing 
users to pick a particular back-end, e.g. Caffe or TensorFlow, along with the 
default one. The interface has to provide few architectures for deep learning 
that are widely used in practice, such as AlexNet. The main motivation for 
using specialized libraries for deep learning would be to fully take advantage 
of the hardware where Spark runs, in particular GPUs. Having the default 
interface in Spark, we will need to wrap only a subset of functions from a 
given specialized library. It does require an effort, however it is not the 
same as wrapping all functions. Wrappers can be provided as packages without 
the need to pull new dependencies into Spark.




*Requirements:* 
1) Basic abstractions such as Neuron, Layer, Error, Regularization, Forward and 
Backpropagation etc. should be implemented as traits or interfaces, so they can 
be easily extended or reused
2) Implement complex abstractions, such as feed forward and recurrent networks
3) Implement multilayer perceptron (MLP), convolutional networks (LeNet), 
autoencoder (sparse and denoising), stacked autoencoder, restricted  boltzmann 
machines (RBM), deep belief networks (DBN) etc.
4) Implement or reuse supporting constucts, such as classifiers, normalizers, 
poolers,  etc.


> Artificial neural networks for MLlib deep learning
> --------------------------------------------------
>
>                 Key: SPARK-5575
>                 URL: https://issues.apache.org/jira/browse/SPARK-5575
>             Project: Spark
>          Issue Type: Umbrella
>          Components: MLlib
>    Affects Versions: 1.2.0
>            Reporter: Alexander Ulanov
>
> *Goal:* Implement various types of artificial neural networks
> *Motivation:* (from https://issues.apache.org/jira/browse/SPARK-15581)
> Having deep learning within Spark's ML library is a question of convenience. 
> Spark has broad analytic capabilities and it is useful to have deep learning 
> as one of these tools at hand. Deep learning is a model of choice for several 
> important modern use-cases, and Spark ML might want to cover them. 
> Eventually, it is hard to explain, why do we have PCA in ML but don't provide 
> Autoencoder. To summarize this, Spark should have at least the most widely 
> used deep learning models, such as fully connected artificial neural network, 
> convolutional network and autoencoder. Advanced and experimental deep 
> learning features might reside within packages or as pluggable external 
> tools. These 3 will provide a comprehensive deep learning set for Spark ML. 
> We might also include recurrent networks as well.
> *Requirements:*
> # Extensible API compatible with Spark ML. Basic abstractions such as Neuron, 
> Layer, Error, Regularization, Forward and Backpropagation etc. should be 
> implemented as traits or interfaces, so they can be easily extended or 
> reused. Define the Spark ML API for deep learning. This interface is similar 
> to the other analytics tools in Spark and supports ML pipelines. This makes 
> deep learning easy to use and plug in into analytics workloads for Spark 
> users. 
> # Efficiency. The current implementation of multilayer perceptron in Spark is 
> less than 2x slower than Caffe, both measured on CPU. The main overhead 
> sources are JVM and Spark's communication layer. For more details, please 
> refer to https://github.com/avulanov/ann-benchmark. Having said that, the 
> efficient implementation of deep learning in Spark should be only few times 
> slower than in specialized tool. This is very reasonable for the platform 
> that does much more than deep learning and I believe it is understood by the 
> community.
> # Scalability. Implement efficient distributed training. It relies heavily on 
> the efficient communication and scheduling mechanisms. The default 
> implementation is based on Spark. More efficient implementations might 
> include some external libraries but use the same interface defined.
> *Main features:* 
> # Multilayer perceptron.
> # Autoencoder 
> # Convolutional neural networks. The interface has to provide few 
> architectures for deep learning that are widely used in practice, such as 
> AlexNet.
> *Additional features:* (lower priority)
> # The internal API of Spark ANN is designed to be flexible and can handle 
> different types of layers. However, only a part of the API is made public. We 
> have to limit the number of public classes in order to make it simpler to 
> support other languages. This forces us to use (String or Number) parameters 
> instead of introducing of new public classes. One of the options to specify 
> the architecture of ANN is to use text configuration with layer-wise 
> description. We have considered using Caffe format for this. It gives the 
> benefit of compatibility with well known deep learning tool and simplifies 
> the support of other languages in Spark. Implementation of a parser for the 
> subset of Caffe format might be the first step towards the support of general 
> ANN architectures in Spark. 
> # Hardware specific optimization. One can wrap other deep learning 
> implementations with this interface allowing users to pick a particular 
> back-end, e.g. Caffe or TensorFlow, along with the default one. The interface 
> has to provide few architectures for deep learning that are widely used in 
> practice, such as AlexNet. The main motivation for using specialized 
> libraries for deep learning would be to fully take advantage of the hardware 
> where Spark runs, in particular GPUs. Having the default interface in Spark, 
> we will need to wrap only a subset of functions from a given specialized 
> library. It does require an effort, however it is not the same as wrapping 
> all functions. Wrappers can be provided as packages without the need to pull 
> new dependencies into Spark.
> *Requirements:* 
> 1) Basic abstractions such as Neuron, Layer, Error, Regularization, Forward 
> and Backpropagation etc. should be implemented as traits or interfaces, so 
> they can be easily extended or reused
> 2) Implement complex abstractions, such as feed forward and recurrent networks
> 3) Implement multilayer perceptron (MLP), convolutional networks (LeNet), 
> autoencoder (sparse and denoising), stacked autoencoder, restricted  
> boltzmann machines (RBM), deep belief networks (DBN) etc.
> 4) Implement or reuse supporting constucts, such as classifiers, normalizers, 
> poolers,  etc.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-5575) Artificial neural networks for MLlib deep learning

Reply via email to