Looks very interesting, thanks for sharing this.

I haven't had much chance to do more than a quick glance over the code.
Quick question - are the Word2Vec and GLOVE implementations fully parallel
on Spark?

On Mon, Jun 8, 2015 at 6:20 PM, Eron Wright <ewri...@live.com> wrote:

>
> The deeplearning4j framework provides a variety of distributed, neural
> network-based learning algorithms, including convolutional nets, deep
> auto-encoders, deep-belief nets, and recurrent nets.      We’re working on
> integration with the Spark ML pipeline, leveraging the developer API.
> This announcement is to share some code and get feedback from the Spark
> community.
>
> The integration code is located in the dl4j-spark-ml module
> <https://github.com/deeplearning4j/deeplearning4j/tree/master/deeplearning4j-scaleout/spark/dl4j-spark-ml>
>  in
> the deeplearning4j repository.
>
> Major aspects of the integration work:
>
>    1. *ML algorithms.*  To bind the dl4j algorithms to the ML pipeline,
>    we developed a new classifier
>    
> <https://github.com/deeplearning4j/deeplearning4j/blob/master/deeplearning4j-scaleout/spark/dl4j-spark-ml/src/main/scala/org/deeplearning4j/spark/ml/classification/MultiLayerNetworkClassification.scala>
>  and
>    a new unsupervised learning estimator
>    
> <https://github.com/deeplearning4j/deeplearning4j/blob/master/deeplearning4j-scaleout/spark/dl4j-spark-ml/src/main/scala/org/deeplearning4j/spark/ml/Unsupervised.scala>.
>
>    2. *ML attributes.* We strove to interoperate well with other pipeline
>    components.   ML Attributes are column-level metadata enabling information
>    sharing between pipeline components.    See here
>    
> <https://github.com/deeplearning4j/deeplearning4j/blob/4d33302dd8a792906050eda82a7d50ff77a8d957/deeplearning4j-scaleout/spark/dl4j-spark-ml/src/main/scala/org/deeplearning4j/spark/ml/classification/MultiLayerNetworkClassification.scala#L89>
>  how
>    the classifier reads label metadata from a column provided by the new
>    StringIndexer
>    
> <http://people.apache.org/~pwendell/spark-releases/spark-1.4.0-rc4-docs/api/scala/index.html#org.apache.spark.ml.feature.StringIndexer>
>    .
>    3. *Large binary data.*  It is challenging to work with large binary
>    data in Spark.   An effective approach is to leverage PrunedScan and to
>    carefully control partition sizes.  Here
>    
> <https://github.com/deeplearning4j/deeplearning4j/blob/master/deeplearning4j-scaleout/spark/dl4j-spark-ml/src/main/scala/org/deeplearning4j/spark/sql/sources/lfw/LfwRelation.scala>
>  we
>    explored this with a custom data source based on the new relation API.
>    4. *Column-based record readers.*  Here
>    
> <https://github.com/deeplearning4j/deeplearning4j/blob/b237385b56d42d24bd3c99d1eece6cb658f387f2/deeplearning4j-scaleout/spark/dl4j-spark-ml/src/main/scala/org/deeplearning4j/spark/sql/sources/lfw/LfwRelation.scala#L96>
>  we
>    explored how to construct rows from a Hadoop input split by composing a
>    number of column-level readers, with pruning support.
>    5. *UDTs*.   With Spark SQL it is possible to introduce new data
>    types.   We prototyped an experimental Tensor type, here
>    
> <https://github.com/deeplearning4j/deeplearning4j/blob/master/deeplearning4j-scaleout/spark/dl4j-spark-ml/src/main/scala/org/deeplearning4j/spark/sql/types/tensors.scala>
>    .
>    6. *Spark Package.*   We developed a spark package to make it easy to
>    use the dl4j framework in spark-shell and with spark-submit.      See the
>    deeplearning4j/dl4j-spark-ml
>    <https://github.com/deeplearning4j/dl4j-spark-ml> repository for
>    useful snippets involving the sbt-spark-package plugin.
>    7. *Example code.*   Examples demonstrate how the standardized ML API
>    simplifies interoperability, such as with label preprocessing and feature
>    scaling.   See the deeplearning4j/dl4j-spark-ml-examples
>    <https://github.com/deeplearning4j/dl4j-spark-ml-examples> repository
>    for an expanding set of example pipelines.
>
> Hope this proves useful to the community as we transition to exciting new
> concepts in Spark SQL and Spark ML.   Meanwhile, we have Spark working
> with multiple GPUs on AWS <http://deeplearning4j.org/gpu_aws.html> and
> we're looking forward to optimizations that will speed neural net training
> even more.
>
> Eron Wright
> Contributor | deeplearning4j.org
>
>

Reply via email to