Re: Supporting Apache Aurora as a cluster manager

2017-09-23 Thread karthik padmanabhan
Hi Mark,

Thanks for getting back. I think you raise a very valid point about getting
into a plug-in base architecture instead of supporting the idiosyncrasies
of different schedulers. Yeah let me write a design doc so that it will at
least be another data point for how we think about the plug-in architecture
discussed in SPARK-19700.

Thanks
Karthik

On Sun, Sep 10, 2017 at 11:02 PM, Mark Hamstra 
wrote:

> While it may be worth creating the design doc and JIRA ticket so that we
> at least have a better idea and a record of what you are talking about, I
> kind of doubt that we are going to want to merge this into the Spark
> codebase. That's not because of anything specific to this Aurora effort,
> but rather because scheduler implementations in general are not going in
> the preferred direction. There is already some regret that the YARN
> scheduler wasn't implemented by means of a scheduler plug-in API, and there
> is likely to be more regret if we continue to go forward with the
> spark-on-kubernetes SPIP in its present form. I'd guess that we are likely
> to merge code associated with that SPIP just because Kubernetes has become
> such an important resource scheduler, but such a merge wouldn't be without
> some misgivings. That is because we just can't get into the position of
> having more and more scheduler implementations in the Spark code, and more
> and more maintenance overhead to keep up with the idiosyncrasies of all the
> scheduler implementations. We've really got to get to the kind of plug-in
> architecture discussed in SPARK-19700 so that scheduler implementations can
> be done outside of the Spark codebase, release schedule, etc.
>
> My opinion on the subject isn't dispositive on its own, of course, but
> that is how I'm seeing things right now.
>
> On Sun, Sep 10, 2017 at 8:27 PM, karthik padmanabhan <
> treadston...@gmail.com> wrote:
>
>> Hi Spark Devs,
>>
>> We are using Aurora (http://aurora.apache.org/) as our mesos framework
>> for running stateless services. We would like to use Aurora to deploy big
>> data and batch workloads as well. And for this we have forked Spark and
>> implement the ExternalClusterManager trait.
>>
>> The reason for doing this and not running Spark on Mesos is to leverage
>> the existing roles and quotas provided by Aurora for admission control and
>> also leverage Aurora features such as priority and preemption. Additionally
>> we would like Aurora to be only deploy/orchestration system that our users
>> should interact with.
>>
>> We have a working POC where Spark is launching jobs through as the
>> ClusterManager. Is this something that can be merged upstream ? If so I can
>> create a design document and create an associated jira ticket.
>>
>> Thanks
>> Karthik
>>
>
>


Re: [VOTE][SPIP] SPARK-21866 Image support in Apache Spark

2017-09-23 Thread vaquar khan
+1 looks good,

Regards,
Vaquar khan

On Sat, Sep 23, 2017 at 12:22 PM, Matei Zaharia 
wrote:

> +1; we should consider something similar for multi-dimensional tensors too.
>
> Matei
>
> > On Sep 23, 2017, at 7:27 AM, Yanbo Liang  wrote:
> >
> > +1
> >
> > On Sat, Sep 23, 2017 at 7:08 PM, Noman Khan  wrote:
> > +1
> >
> > Regards
> > Noman
> > From: Denny Lee 
> > Sent: Friday, September 22, 2017 2:59:33 AM
> > To: Apache Spark Dev; Sean Owen; Tim Hunter
> > Cc: Danil Kirsanov; Joseph Bradley; Reynold Xin; Sudarshan Sudarshan
> > Subject: Re: [VOTE][SPIP] SPARK-21866 Image support in Apache Spark
> >
> > +1
> >
> > On Thu, Sep 21, 2017 at 11:15 Sean Owen  wrote:
> > Am I right that this doesn't mean other packages would use this
> representation, but that they could?
> >
> > The representation looked fine to me w.r.t. what DL frameworks need.
> >
> > My previous comment was that this is actually quite lightweight. It's
> kind of like how I/O support is provided for CSV and JSON, so makes enough
> sense to add to Spark. It doesn't really preclude other solutions.
> >
> > For those reasons I think it's fine. +1
> >
> > On Thu, Sep 21, 2017 at 6:32 PM Tim Hunter 
> wrote:
> > Hello community,
> >
> > I would like to call for a vote on SPARK-21866. It is a short proposal
> that has important applications for image processing and deep learning.
> Joseph Bradley has offered to be the shepherd.
> >
> > JIRA ticket: https://issues.apache.org/jira/browse/SPARK-21866
> > PDF version: https://issues.apache.org/jira/secure/attachment/
> 12884792/SPIP%20-%20Image%20support%20for%20Apache%20Spark%20V1.1.pdf
> >
> > Background and motivation
> > As Apache Spark is being used more and more in the industry, some new
> use cases are emerging for different data formats beyond the traditional
> SQL types or the numerical types (vectors and matrices). Deep Learning
> applications commonly deal with image processing. A number of projects add
> some Deep Learning capabilities to Spark (see list below), but they
> struggle to communicate with each other or with MLlib pipelines because
> there is no standard way to represent an image in Spark DataFrames. We
> propose to federate efforts for representing images in Spark by defining a
> representation that caters to the most common needs of users and library
> developers.
> > This SPIP proposes a specification to represent images in Spark
> DataFrames and Datasets (based on existing industrial standards), and an
> interface for loading sources of images. It is not meant to be a
> full-fledged image processing library, but rather the core description that
> other libraries and users can rely on. Several packages already offer
> various processing facilities for transforming images or doing more complex
> operations, and each has various design tradeoffs that make them better as
> standalone solutions.
> > This project is a joint collaboration between Microsoft and Databricks,
> which have been testing this design in two open source packages: MMLSpark
> and Deep Learning Pipelines.
> > The proposed image format is an in-memory, decompressed representation
> that targets low-level applications. It is significantly more liberal in
> memory usage than compressed image representations such as JPEG, PNG, etc.,
> but it allows easy communication with popular image processing libraries
> and has no decoding overhead.
> > Targets users and personas:
> > Data scientists, data engineers, library developers.
> > The following libraries define primitives for loading and representing
> images, and will gain from a common interchange format (in alphabetical
> order):
> >   • BigDL
> >   • DeepLearning4J
> >   • Deep Learning Pipelines
> >   • MMLSpark
> >   • TensorFlow (Spark connector)
> >   • TensorFlowOnSpark
> >   • TensorFrames
> >   • Thunder
> > Goals:
> >   • Simple representation of images in Spark DataFrames, based on
> pre-existing industrial standards (OpenCV)
> >   • This format should eventually allow the development of
> high-performance integration points with image processing libraries such as
> libOpenCV, Google TensorFlow, CNTK, and other C libraries.
> >   • The reader should be able to read popular formats of images from
> distributed sources.
> > Non-Goals:
> > Images are a versatile medium and encompass a very wide range of formats
> and representations. This SPIP explicitly aims at the most common use case
> in the industry currently: multi-channel matrices of binary, int32, int64,
> float or double data that can fit comfortably in the heap of the JVM:
> >   • the total size of an image should be restricted to less than 2GB
> (roughly)
> >   • the meaning of color channels is application-specific and is not
> mandated by the standard (in line with the OpenCV standard)
> >   • 

Re: [VOTE][SPIP] SPARK-21866 Image support in Apache Spark

2017-09-23 Thread Matei Zaharia
+1; we should consider something similar for multi-dimensional tensors too.

Matei

> On Sep 23, 2017, at 7:27 AM, Yanbo Liang  wrote:
> 
> +1
> 
> On Sat, Sep 23, 2017 at 7:08 PM, Noman Khan  wrote:
> +1 
> 
> Regards 
> Noman 
> From: Denny Lee 
> Sent: Friday, September 22, 2017 2:59:33 AM
> To: Apache Spark Dev; Sean Owen; Tim Hunter
> Cc: Danil Kirsanov; Joseph Bradley; Reynold Xin; Sudarshan Sudarshan
> Subject: Re: [VOTE][SPIP] SPARK-21866 Image support in Apache Spark
>  
> +1 
> 
> On Thu, Sep 21, 2017 at 11:15 Sean Owen  wrote:
> Am I right that this doesn't mean other packages would use this 
> representation, but that they could?
> 
> The representation looked fine to me w.r.t. what DL frameworks need.
> 
> My previous comment was that this is actually quite lightweight. It's kind of 
> like how I/O support is provided for CSV and JSON, so makes enough sense to 
> add to Spark. It doesn't really preclude other solutions.
> 
> For those reasons I think it's fine. +1
> 
> On Thu, Sep 21, 2017 at 6:32 PM Tim Hunter  wrote:
> Hello community,
> 
> I would like to call for a vote on SPARK-21866. It is a short proposal that 
> has important applications for image processing and deep learning. Joseph 
> Bradley has offered to be the shepherd.
> 
> JIRA ticket: https://issues.apache.org/jira/browse/SPARK-21866
> PDF version: 
> https://issues.apache.org/jira/secure/attachment/12884792/SPIP%20-%20Image%20support%20for%20Apache%20Spark%20V1.1.pdf
> 
> Background and motivation
> As Apache Spark is being used more and more in the industry, some new use 
> cases are emerging for different data formats beyond the traditional SQL 
> types or the numerical types (vectors and matrices). Deep Learning 
> applications commonly deal with image processing. A number of projects add 
> some Deep Learning capabilities to Spark (see list below), but they struggle 
> to communicate with each other or with MLlib pipelines because there is no 
> standard way to represent an image in Spark DataFrames. We propose to 
> federate efforts for representing images in Spark by defining a 
> representation that caters to the most common needs of users and library 
> developers.
> This SPIP proposes a specification to represent images in Spark DataFrames 
> and Datasets (based on existing industrial standards), and an interface for 
> loading sources of images. It is not meant to be a full-fledged image 
> processing library, but rather the core description that other libraries and 
> users can rely on. Several packages already offer various processing 
> facilities for transforming images or doing more complex operations, and each 
> has various design tradeoffs that make them better as standalone solutions.
> This project is a joint collaboration between Microsoft and Databricks, which 
> have been testing this design in two open source packages: MMLSpark and Deep 
> Learning Pipelines.
> The proposed image format is an in-memory, decompressed representation that 
> targets low-level applications. It is significantly more liberal in memory 
> usage than compressed image representations such as JPEG, PNG, etc., but it 
> allows easy communication with popular image processing libraries and has no 
> decoding overhead.
> Targets users and personas:
> Data scientists, data engineers, library developers.
> The following libraries define primitives for loading and representing 
> images, and will gain from a common interchange format (in alphabetical 
> order):
>   • BigDL
>   • DeepLearning4J
>   • Deep Learning Pipelines
>   • MMLSpark
>   • TensorFlow (Spark connector)
>   • TensorFlowOnSpark
>   • TensorFrames
>   • Thunder
> Goals:
>   • Simple representation of images in Spark DataFrames, based on 
> pre-existing industrial standards (OpenCV)
>   • This format should eventually allow the development of 
> high-performance integration points with image processing libraries such as 
> libOpenCV, Google TensorFlow, CNTK, and other C libraries.
>   • The reader should be able to read popular formats of images from 
> distributed sources.
> Non-Goals:
> Images are a versatile medium and encompass a very wide range of formats and 
> representations. This SPIP explicitly aims at the most common use case in the 
> industry currently: multi-channel matrices of binary, int32, int64, float or 
> double data that can fit comfortably in the heap of the JVM:
>   • the total size of an image should be restricted to less than 2GB 
> (roughly)
>   • the meaning of color channels is application-specific and is not 
> mandated by the standard (in line with the OpenCV standard)
>   • specialized formats used in meteorology, the medical field, etc. are 
> not supported
>   • this format is specialized to images and does not attempt to solve 
> the more general problem of 

Tagging issues for 2.1.2 / 2.1.3

2017-09-23 Thread Holden Karau
Just a friendly reminder, I've cut the next RC tag, that we're during the
release cycle so if you're merging issues into branch-2.1 please tag your
issues into 2.1.3 and I'll retag any issues in 2.1.3 into 2.1.2 when/if I
have to cut the next RC tag. I'll take care of this for issues merged into
branch-2.1 at this point, but just for the rest of the release if y'all
could help me out :)

-- 
Twitter: https://twitter.com/holdenkarau


Re: [VOTE][SPIP] SPARK-21866 Image support in Apache Spark

2017-09-23 Thread Yanbo Liang
+1

On Sat, Sep 23, 2017 at 7:08 PM, Noman Khan  wrote:

> +1
>
> Regards
> Noman
> --
> *From:* Denny Lee 
> *Sent:* Friday, September 22, 2017 2:59:33 AM
> *To:* Apache Spark Dev; Sean Owen; Tim Hunter
> *Cc:* Danil Kirsanov; Joseph Bradley; Reynold Xin; Sudarshan Sudarshan
> *Subject:* Re: [VOTE][SPIP] SPARK-21866 Image support in Apache Spark
>
> +1
>
> On Thu, Sep 21, 2017 at 11:15 Sean Owen  wrote:
>
>> Am I right that this doesn't mean other packages would use this
>> representation, but that they could?
>>
>> The representation looked fine to me w.r.t. what DL frameworks need.
>>
>> My previous comment was that this is actually quite lightweight. It's
>> kind of like how I/O support is provided for CSV and JSON, so makes enough
>> sense to add to Spark. It doesn't really preclude other solutions.
>>
>> For those reasons I think it's fine. +1
>>
>> On Thu, Sep 21, 2017 at 6:32 PM Tim Hunter 
>> wrote:
>>
>>> Hello community,
>>>
>>> I would like to call for a vote on SPARK-21866. It is a short proposal
>>> that has important applications for image processing and deep learning.
>>> Joseph Bradley has offered to be the shepherd.
>>>
>>> JIRA ticket: https://issues.apache.org/jira/browse/SPARK-21866
>>> PDF version: https://issues.apache.org/jira/secure/attachment/12884792/
>>> SPIP%20-%20Image%20support%20for%20Apache%20Spark%20V1.1.pdf
>>>
>>> Background and motivation
>>>
>>> As Apache Spark is being used more and more in the industry, some new
>>> use cases are emerging for different data formats beyond the traditional
>>> SQL types or the numerical types (vectors and matrices). Deep Learning
>>> applications commonly deal with image processing. A number of projects add
>>> some Deep Learning capabilities to Spark (see list below), but they
>>> struggle to communicate with each other or with MLlib pipelines because
>>> there is no standard way to represent an image in Spark DataFrames. We
>>> propose to federate efforts for representing images in Spark by defining a
>>> representation that caters to the most common needs of users and library
>>> developers.
>>>
>>> This SPIP proposes a specification to represent images in Spark
>>> DataFrames and Datasets (based on existing industrial standards), and an
>>> interface for loading sources of images. It is not meant to be a
>>> full-fledged image processing library, but rather the core description that
>>> other libraries and users can rely on. Several packages already offer
>>> various processing facilities for transforming images or doing more complex
>>> operations, and each has various design tradeoffs that make them better as
>>> standalone solutions.
>>>
>>> This project is a joint collaboration between Microsoft and Databricks,
>>> which have been testing this design in two open source packages: MMLSpark
>>> and Deep Learning Pipelines.
>>>
>>> The proposed image format is an in-memory, decompressed representation
>>> that targets low-level applications. It is significantly more liberal in
>>> memory usage than compressed image representations such as JPEG, PNG, etc.,
>>> but it allows easy communication with popular image processing libraries
>>> and has no decoding overhead.
>>> Targets users and personas:
>>>
>>> Data scientists, data engineers, library developers.
>>> The following libraries define primitives for loading and representing
>>> images, and will gain from a common interchange format (in alphabetical
>>> order):
>>>
>>>- BigDL
>>>- DeepLearning4J
>>>- Deep Learning Pipelines
>>>- MMLSpark
>>>- TensorFlow (Spark connector)
>>>- TensorFlowOnSpark
>>>- TensorFrames
>>>- Thunder
>>>
>>> Goals:
>>>
>>>- Simple representation of images in Spark DataFrames, based on
>>>pre-existing industrial standards (OpenCV)
>>>- This format should eventually allow the development of
>>>high-performance integration points with image processing libraries such 
>>> as
>>>libOpenCV, Google TensorFlow, CNTK, and other C libraries.
>>>- The reader should be able to read popular formats of images from
>>>distributed sources.
>>>
>>> Non-Goals:
>>>
>>> Images are a versatile medium and encompass a very wide range of formats
>>> and representations. This SPIP explicitly aims at the most common use
>>> case in the industry currently: multi-channel matrices of binary, int32,
>>> int64, float or double data that can fit comfortably in the heap of the JVM:
>>>
>>>- the total size of an image should be restricted to less than 2GB
>>>(roughly)
>>>- the meaning of color channels is application-specific and is not
>>>mandated by the standard (in line with the OpenCV standard)
>>>- specialized formats used in meteorology, the medical field, etc.
>>>are not supported
>>>- this format is specialized to images and does not attempt to solve
>>>the more general 

Re: [VOTE][SPIP] SPARK-21866 Image support in Apache Spark

2017-09-23 Thread Noman Khan
+1

Regards
Noman

From: Denny Lee 
Sent: Friday, September 22, 2017 2:59:33 AM
To: Apache Spark Dev; Sean Owen; Tim Hunter
Cc: Danil Kirsanov; Joseph Bradley; Reynold Xin; Sudarshan Sudarshan
Subject: Re: [VOTE][SPIP] SPARK-21866 Image support in Apache Spark

+1

On Thu, Sep 21, 2017 at 11:15 Sean Owen 
> wrote:
Am I right that this doesn't mean other packages would use this representation, 
but that they could?

The representation looked fine to me w.r.t. what DL frameworks need.

My previous comment was that this is actually quite lightweight. It's kind of 
like how I/O support is provided for CSV and JSON, so makes enough sense to add 
to Spark. It doesn't really preclude other solutions.

For those reasons I think it's fine. +1

On Thu, Sep 21, 2017 at 6:32 PM Tim Hunter 
> wrote:
Hello community,

I would like to call for a vote on SPARK-21866. It is a short proposal that has 
important applications for image processing and deep learning. Joseph Bradley 
has offered to be the shepherd.

JIRA ticket: https://issues.apache.org/jira/browse/SPARK-21866
PDF version: 
https://issues.apache.org/jira/secure/attachment/12884792/SPIP%20-%20Image%20support%20for%20Apache%20Spark%20V1.1.pdf

Background and motivation

As Apache Spark is being used more and more in the industry, some new use cases 
are emerging for different data formats beyond the traditional SQL types or the 
numerical types (vectors and matrices). Deep Learning applications commonly 
deal with image processing. A number of projects add some Deep Learning 
capabilities to Spark (see list below), but they struggle to communicate with 
each other or with MLlib pipelines because there is no standard way to 
represent an image in Spark DataFrames. We propose to federate efforts for 
representing images in Spark by defining a representation that caters to the 
most common needs of users and library developers.

This SPIP proposes a specification to represent images in Spark DataFrames and 
Datasets (based on existing industrial standards), and an interface for loading 
sources of images. It is not meant to be a full-fledged image processing 
library, but rather the core description that other libraries and users can 
rely on. Several packages already offer various processing facilities for 
transforming images or doing more complex operations, and each has various 
design tradeoffs that make them better as standalone solutions.

This project is a joint collaboration between Microsoft and Databricks, which 
have been testing this design in two open source packages: MMLSpark and Deep 
Learning Pipelines.

The proposed image format is an in-memory, decompressed representation that 
targets low-level applications. It is significantly more liberal in memory 
usage than compressed image representations such as JPEG, PNG, etc., but it 
allows easy communication with popular image processing libraries and has no 
decoding overhead.

Targets users and personas:

Data scientists, data engineers, library developers.
The following libraries define primitives for loading and representing images, 
and will gain from a common interchange format (in alphabetical order):

  *   BigDL
  *   DeepLearning4J
  *   Deep Learning Pipelines
  *   MMLSpark
  *   TensorFlow (Spark connector)
  *   TensorFlowOnSpark
  *   TensorFrames
  *   Thunder

Goals:

  *   Simple representation of images in Spark DataFrames, based on 
pre-existing industrial standards (OpenCV)
  *   This format should eventually allow the development of high-performance 
integration points with image processing libraries such as libOpenCV, Google 
TensorFlow, CNTK, and other C libraries.
  *   The reader should be able to read popular formats of images from 
distributed sources.

Non-Goals:

Images are a versatile medium and encompass a very wide range of formats and 
representations. This SPIP explicitly aims at the most common use case in the 
industry currently: multi-channel matrices of binary, int32, int64, float or 
double data that can fit comfortably in the heap of the JVM:

  *   the total size of an image should be restricted to less than 2GB (roughly)
  *   the meaning of color channels is application-specific and is not mandated 
by the standard (in line with the OpenCV standard)
  *   specialized formats used in meteorology, the medical field, etc. are not 
supported
  *   this format is specialized to images and does not attempt to solve the 
more general problem of representing n-dimensional tensors in Spark

Proposed API changes

We propose to add a new package in the package structure, under the MLlib 
project:
org.apache.spark.image

Data format

We propose to add the following structure:

imageSchema = StructType([

  *   StructField("mode", StringType(), False),
 *   The exact representation of the data.
 *   The