Re: does "Deep Learning Pipelines" scale out linearly?
Hello Andy, regarding your question, this will depend a lot on the specific task: - for tasks that are "easy" to distribute such as inference (scoring), hyper-parameter tuning or cross-validation, these tasks will take full advantage of the cluster and the performance should improve more or less linearly - for training the same model with multiple machines, and a distributed dataset, then you are currently better off with a dedicated solution such as TensorFlowOnSpark or dist-keras. We are working on addressing this issue in a future release. Also, we opened a mailing list dedicated to Deep Learning Pipelines, to which I will copy this answer. Feel free to answer there: https://groups.google.com/forum/#!forum/dl-pipelines-users/ Tim On November 22, 2017 at 10:02:59 AM, Andy Davidson (a...@santacruzintegration.com) wrote: > I am starting a new deep learning project currently we do all of our work on > a single machine using a combination of Keras and Tensor flow. > https://databricks.github.io/spark-deep-learning/site/index.html looks very > promising. Any idea how performance is likely to improve as I add machines > to my my cluster? > > Kind regards > > Andy > > > P.s. Is user@spark.apache.org the best place to ask questions about this > package? > > > > > - To unsubscribe e-mail: user-unsubscr...@spark.apache.org
[ann] Release of TensorFrames 0.2.8
Hello all, I would like to bring to your attention the (long overdue) release of a new version of TensorFrames. Thank you to all people who have reported some packaging and installation issues. This release fixes a large number of performance and stability problems, and brings a few improvements. As an example, following this notebook [1], you can distribute the classification of images using Spark, TensorFlow and the Inception V3 model from google. It is published as a Databricks notebook and it has been tested on Jupyter as well. What is TensorFrames? TensorFrames (TensorFlow on Spark Dataframes) lets you manipulate Spark's DataFrames with TensorFlow programs. Spark package: https://spark-packages.org/package/databricks/tensorframes Release notes: https://github.com/databricks/tensorframes/releases/tag/v0.2.8 Best regards Tim Hunter [1] https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/5181772898130619/2927463166045304/1282150081618649/latest.html
GraphFrames 0.2.0 released
Hello all, I have released version 0.2.0 of the GraphFrames package. Apart from a few bug fixes, it is the first release published for Spark 2.0 and both scala 2.10 and 2.11. Please let us know if you have any comment or questions. It is available as a Spark package: https://spark-packages.org/package/graphframes/graphframes The source code is available as always at https://github.com/graphframes/graphframes What is GraphFrames? GraphFrames is a DataFrame-based graph engine Spark. In addition to the algorithms available in GraphX, users can write highly expressive queries by leveraging the DataFrame API, combined with a new API for motif finding. The user also benefits from DataFrame performance optimizations within the Spark SQL engine. Cheers Tim
Request for comments: Tensorframes, an integration library between TensorFlow and Spark DataFrames
Hello all, I would like to bring your attention to a small project to integrate TensorFlow with Apache Spark, called TensorFrames. With this library, you can map, reduce or aggregate numerical data stored in Spark dataframes using TensorFlow computation graphs. It is published as a Spark package and available in this github repository: https://github.com/tjhunter/tensorframes More detailed examples can be found in the user guide: https://github.com/tjhunter/tensorframes/wiki/TensorFrames-user-guide This is a technical preview at this point. I am looking forward to some feedback about the current python API if some adventurous users want to try it out. Of course, contributions are most welcome, for example to fix bugs or to add support for platforms other than linux-x86_64. It should support all the most common inputs in dataframes (dense tensors of rank 0, 1, 2 of ints, longs, floats and doubles). Please note that this is not an endorsement by Databricks of TensorFlow, or any other deep learning framework for that matter. If users want to use deep learning in production, some other more robust solutions are available: SparkNet, CaffeOnSpark, DeepLearning4J. Best regards Tim Hunter
Introducing spark-sklearn, a scikit-learn integration package for Spark
Hello community, I would like to introduce a new Spark package that should be useful for python users who depend on scikit-learn. Among other tools: - train and evaluate multiple scikit-learn models in parallel. - convert Spark's Dataframes seamlessly into numpy arrays - (experimental) distribute Scipy's sparse matrices as a dataset of sparse vectors. Spark-sklearn focuses on problems that have a small amount of data and that can be run in parallel. Note this package distributes simple tasks like grid-search cross-validation. It does not distribute individual learning algorithms (unlike Spark MLlib). If you want to use it, see instructions on the package page: https://github.com/databricks/spark-sklearn This blog post contains more details: https://databricks.com/blog/2016/02/08/auto-scaling-scikit-learn-with-spark.html Let us know if you have any questions. Also, documentation or code contributions are much welcome (Apache 2.0 license). Cheers Tim and Joseph - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org