Re: does "Deep Learning Pipelines" scale out linearly?

2017-11-28 Thread Tim Hunter
Hello Andy,
regarding your question, this will depend a lot on the specific task:
 - for tasks that are "easy" to distribute such as inference
(scoring), hyper-parameter tuning or cross-validation, these tasks
will take full advantage of the cluster and the performance should
improve more or less linearly
 - for training the same model with multiple machines, and a
distributed dataset, then you are currently better off with a
dedicated solution such as TensorFlowOnSpark or dist-keras. We are
working on addressing this issue in a future release.

Also, we opened a mailing list dedicated to Deep Learning Pipelines,
to which I will copy this answer. Feel free to answer there:

https://groups.google.com/forum/#!forum/dl-pipelines-users/


Tim


On November 22, 2017 at 10:02:59 AM, Andy Davidson
(a...@santacruzintegration.com) wrote:
> I am starting a new deep learning project currently we do all of our work on
> a single machine using a combination of Keras and Tensor flow.
> https://databricks.github.io/spark-deep-learning/site/index.html looks very
> promising. Any idea how performance is likely to improve as I add machines
> to my my cluster?
>
> Kind regards
>
> Andy
>
>
> P.s. Is user@spark.apache.org the best place to ask questions about this
> package?
>
>
>
>
>

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



[ann] Release of TensorFrames 0.2.8

2017-04-25 Thread Tim Hunter
Hello all,

I would like to bring to your attention the (long overdue) release of a new
version of TensorFrames. Thank you to all people who have reported some
packaging and installation issues. This release fixes a large number of
performance and stability problems, and brings a few improvements.

As an example, following this notebook [1], you can distribute the
classification of images using Spark, TensorFlow and the Inception V3 model
from google. It is published as a Databricks notebook and it has been
tested on Jupyter as well.

What is TensorFrames?
TensorFrames (TensorFlow on Spark Dataframes) lets you manipulate Spark's
DataFrames with TensorFlow programs.

Spark package:
https://spark-packages.org/package/databricks/tensorframes

Release notes:
https://github.com/databricks/tensorframes/releases/tag/v0.2.8

Best regards

Tim Hunter


[1]
https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/5181772898130619/2927463166045304/1282150081618649/latest.html


GraphFrames 0.2.0 released

2016-08-16 Thread Tim Hunter
Hello all,
I have released version 0.2.0 of the GraphFrames package. Apart from a few
bug fixes, it is the first release published for Spark 2.0 and both scala
2.10 and 2.11. Please let us know if you have any comment or questions.

It is available as a Spark package:
https://spark-packages.org/package/graphframes/graphframes

The source code is available as always at
https://github.com/graphframes/graphframes


What is GraphFrames?

GraphFrames is a DataFrame-based graph engine Spark. In addition to the
algorithms available in GraphX, users can write highly expressive queries
by leveraging the DataFrame API, combined with a new API for motif finding.
The user also benefits from DataFrame performance optimizations within the
Spark SQL engine.

Cheers

Tim


Request for comments: Tensorframes, an integration library between TensorFlow and Spark DataFrames

2016-03-18 Thread Tim Hunter
Hello all,

I would like to bring your attention to a small project to integrate
TensorFlow with Apache Spark, called TensorFrames. With this library, you
can map, reduce or aggregate numerical data stored in Spark dataframes
using TensorFlow computation graphs. It is published as a Spark package and
available in this github repository:

https://github.com/tjhunter/tensorframes

More detailed examples can be found in the user guide:

https://github.com/tjhunter/tensorframes/wiki/TensorFrames-user-guide

This is a technical preview at this point. I am looking forward to some
feedback about the current python API if some adventurous users want to try
it out. Of course, contributions are most welcome, for example to fix bugs
or to add support for platforms other than linux-x86_64. It should support
all the most common inputs in dataframes (dense tensors of rank 0, 1, 2 of
ints, longs, floats and doubles).

Please note that this is not an endorsement by Databricks of TensorFlow, or
any other deep learning framework for that matter. If users want to use
deep learning in production, some other more robust solutions are
available: SparkNet, CaffeOnSpark, DeepLearning4J.

Best regards


Tim Hunter


Introducing spark-sklearn, a scikit-learn integration package for Spark

2016-02-10 Thread Tim Hunter
Hello community,
I would like to introduce a new Spark package that should
be useful for python users who depend on scikit-learn.

Among other tools:
 - train and evaluate multiple scikit-learn models in parallel.
 - convert Spark's Dataframes seamlessly into numpy arrays
 - (experimental) distribute Scipy's sparse matrices as a dataset of
sparse vectors.

Spark-sklearn focuses on problems that have a small amount of data and
that can be run in parallel. Note this package distributes simple
tasks like grid-search cross-validation. It does not distribute
individual learning algorithms (unlike Spark MLlib).

If you want to use it, see instructions on the package page:
https://github.com/databricks/spark-sklearn

This blog post contains more details:
https://databricks.com/blog/2016/02/08/auto-scaling-scikit-learn-with-spark.html

Let us know if you have any questions. Also, documentation or code
contributions are much welcome (Apache 2.0 license).

Cheers

Tim and Joseph

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org