[ https://issues.apache.org/jira/browse/SPARK-5575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15536321#comment-15536321 ]
Alexander Ulanov commented on SPARK-5575: ----------------------------------------- I recently released a package to handle new features that are not yet merged in Spark: https://spark-packages.org/package/avulanov/scalable-deeplearning > Artificial neural networks for MLlib deep learning > -------------------------------------------------- > > Key: SPARK-5575 > URL: https://issues.apache.org/jira/browse/SPARK-5575 > Project: Spark > Issue Type: Umbrella > Components: MLlib > Affects Versions: 1.2.0 > Reporter: Alexander Ulanov > > *Goal:* Implement various types of artificial neural networks > *Motivation:* (from https://issues.apache.org/jira/browse/SPARK-15581) > Having deep learning within Spark's ML library is a question of convenience. > Spark has broad analytic capabilities and it is useful to have deep learning > as one of these tools at hand. Deep learning is a model of choice for several > important modern use-cases, and Spark ML might want to cover them. > Eventually, it is hard to explain, why do we have PCA in ML but don't provide > Autoencoder. To summarize this, Spark should have at least the most widely > used deep learning models, such as fully connected artificial neural network, > convolutional network and autoencoder. Advanced and experimental deep > learning features might reside within packages or as pluggable external > tools. These 3 will provide a comprehensive deep learning set for Spark ML. > We might also include recurrent networks as well. > *Requirements:* > # Extensible API compatible with Spark ML. Basic abstractions such as Neuron, > Layer, Error, Regularization, Forward and Backpropagation etc. should be > implemented as traits or interfaces, so they can be easily extended or > reused. Define the Spark ML API for deep learning. This interface is similar > to the other analytics tools in Spark and supports ML pipelines. This makes > deep learning easy to use and plug in into analytics workloads for Spark > users. > # Efficiency. The current implementation of multilayer perceptron in Spark is > less than 2x slower than Caffe, both measured on CPU. The main overhead > sources are JVM and Spark's communication layer. For more details, please > refer to https://github.com/avulanov/ann-benchmark. Having said that, the > efficient implementation of deep learning in Spark should be only few times > slower than in specialized tool. This is very reasonable for the platform > that does much more than deep learning and I believe it is understood by the > community. > # Scalability. Implement efficient distributed training. It relies heavily on > the efficient communication and scheduling mechanisms. The default > implementation is based on Spark. More efficient implementations might > include some external libraries but use the same interface defined. > *Main features:* > # Multilayer perceptron classifier (MLP) > # Autoencoder > # Convolutional neural networks for computer vision. The interface has to > provide few architectures for deep learning that are widely used in practice, > such as AlexNet > *Additional features:* > # Other architectures, such as Recurrent neural network (RNN), Long-short > term memory (LSTM), Restricted boltzmann machine (RBM), deep belief network > (DBN), MLP multivariate regression > # Regularizers, such as L1, L2, drop-out > # Normalizers > # Network customization. The internal API of Spark ANN is designed to be > flexible and can handle different types of layers. However, only a part of > the API is made public. We have to limit the number of public classes in > order to make it simpler to support other languages. This forces us to use > (String or Number) parameters instead of introducing of new public classes. > One of the options to specify the architecture of ANN is to use text > configuration with layer-wise description. We have considered using Caffe > format for this. It gives the benefit of compatibility with well known deep > learning tool and simplifies the support of other languages in Spark. > Implementation of a parser for the subset of Caffe format might be the first > step towards the support of general ANN architectures in Spark. > # Hardware specific optimization. One can wrap other deep learning > implementations with this interface allowing users to pick a particular > back-end, e.g. Caffe or TensorFlow, along with the default one. The interface > has to provide few architectures for deep learning that are widely used in > practice, such as AlexNet. The main motivation for using specialized > libraries for deep learning would be to fully take advantage of the hardware > where Spark runs, in particular GPUs. Having the default interface in Spark, > we will need to wrap only a subset of functions from a given specialized > library. It does require an effort, however it is not the same as wrapping > all functions. Wrappers can be provided as packages without the need to pull > new dependencies into Spark. > *Completed (merged to the main Spark branch):* > * Requirements: https://issues.apache.org/jira/browse/SPARK-9471 > ** API > https://spark-summit.org/eu-2015/events/a-scalable-implementation-of-deep-learning-on-spark/ > ** Efficiency & Scalability: https://github.com/avulanov/ann-benchmark > * Features: > ** Multilayer perceptron classifier > https://issues.apache.org/jira/browse/SPARK-9471 > *In progress (pull request):* > * Features: > ** Autoencoder https://issues.apache.org/jira/browse/SPARK-2623 > * Additional features: > ** MLP regression: https://issues.apache.org/jira/browse/SPARK-10409 > *Scalable deep learning package:* > * This package is intended for new Spark deep learning features that were not > yet merged to Spark ML or that are too specific to be merged: > https://spark-packages.org/package/avulanov/scalable-deeplearning -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org