The deeplearning4j framework provides a variety of distributed, neural 
network-based learning algorithms, including convolutional nets, deep 
auto-encoders, deep-belief nets, and recurrent nets.      We’re working on 
integration with the Spark ML pipeline, leveraging the developer API.   This 
announcement is to share some code and get feedback from the Spark community.

The integration code is located in the dl4j-spark-ml module in the 
deeplearning4j repository.

Major aspects of the integration work:
ML algorithms.  To bind the dl4j algorithms to the ML pipeline, we developed a 
new classifier and a new unsupervised learning estimator.   
ML attributes. We strove to interoperate well with other pipeline components.   
ML Attributes are column-level metadata enabling information sharing between 
pipeline components.    See here how the classifier reads label metadata from a 
column provided by the new StringIndexer.
Large binary data.  It is challenging to work with large binary data in Spark.  
 An effective approach is to leverage PrunedScan and to carefully control 
partition sizes.  Here we explored this with a custom data source based on the 
new relation API.   
Column-based record readers.  Here we explored how to construct rows from a 
Hadoop input split by composing a number of column-level readers, with pruning 
support.
UDTs.   With Spark SQL it is possible to introduce new data types.   We 
prototyped an experimental Tensor type, here.
Spark Package.   We developed a spark package to make it easy to use the dl4j 
framework in spark-shell and with spark-submit.      See the 
deeplearning4j/dl4j-spark-ml repository for useful snippets involving the 
sbt-spark-package plugin.
Example code.   Examples demonstrate how the standardized ML API simplifies 
interoperability, such as with label preprocessing and feature scaling.   See 
the deeplearning4j/dl4j-spark-ml-examples repository for an expanding set of 
example pipelines.
Hope this proves useful to the community as we transition to exciting new 
concepts in Spark SQL and Spark ML.   Meanwhile, we have Spark working with 
multiple GPUs on AWS and we're looking forward to optimizations that will speed 
neural net training even more. 

Eron Wright
Contributor | deeplearning4j.org


Reply via email to