Re: SPIP: Spark on Kubernetes

2017-08-18 Thread Sudarshan Kadambi
+1 (non-binding) 

We are evaluating Kubernetes for a variety of data processing workloads.
Spark is the natural choice for some of these workloads. Native Spark on
Kubernetes is of interest to us as it brings in dynamic allocation, resource
isolation and improved notions of security. 



--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/SPIP-Spark-on-Kubernetes-tp22147p22197.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: SPIP: Spark on Kubernetes

2017-08-18 Thread Matt Cheah
The problem with maintaining this scheduler separately right now is that the 
scheduler backend is dependent upon the CoarseGrainedSchedulerBackend class, 
which is not as much a stable API as it is an internal class with components 
that currently need to be shared by all of the scheduler backends. This makes 
it such that maintaining this scheduler requires not just maintaining a single 
small module, but an entire fork of the project as well, so that the cluster 
manager specific scheduler backend can keep up with the changes to 
CoarseGrainedSchedulerBackend. If we wanted to avoid forking the entire project 
and only provide the scheduler backend as a pluggable module, we would need a 
fully pluggable scheduler backend with a stable API, as Erik mentioned. We also 
needed to change the spark-submit code to recognize Kubernetes mode and be able 
to delegate to the Kubernetes submission client, so that would need to be 
pluggable as well.

 

More discussion on fully pluggable scheduler backends is at 
https://issues.apache.org/jira/browse/SPARK-19700.

 

-Matt Cheah

 

From: Erik Erlandson 
Date: Friday, August 18, 2017 at 8:34 AM
To: "dev@spark.apache.org" 
Subject: Re: SPIP: Spark on Kubernetes

 

There are a fair number of people (myself included) who have interest in making 
scheduler back-ends fully pluggable.  That will represent a significant impact 
to core spark architecture, with corresponding risk. Adding the kubernetes 
back-end in a manner similar to the other three back-ends has had a very small 
impact on spark core, which allowed it to be developed in parallel and easily 
stay re-based on successive spark releases while we were developing it and 
building up community support.

 

On Thu, Aug 17, 2017 at 7:14 PM, Mridul Muralidharan  wrote:

While I definitely support the idea of Apache Spark being able to
leverage kubernetes, IMO it is better for long term evolution of spark
to expose appropriate SPI such that this support need not necessarily
live within Apache Spark code base.
It will allow for multiple backends to evolve, decoupled from spark core.
In this case, would have made maintaining apache-spark-on-k8s repo
easier; just as it would allow for supporting other backends -
opensource (nomad for ex) and proprietary.

In retrospect directly integrating yarn support into spark, while
mirroring mesos support at that time, was probably an incorrect design
choice on my part.


Regards,
Mridul

On Tue, Aug 15, 2017 at 8:32 AM, Anirudh Ramanathan
 wrote:

> Spark on Kubernetes effort has been developed separately in a fork, and
> linked back from the Apache Spark project as an experimental backend. We're
> ~6 months in, have had 5 releases.
>
> 2 Spark versions maintained (2.1, and 2.2)
> Extensive integration testing and refactoring efforts to maintain code
> quality
> Developer and user-facing documentation
> 10+ consistent code contributors from different organizations involved in
> actively maintaining and using the project, with several more members
> involved in testing and providing feedback.
> The community has delivered several talks on Spark-on-Kubernetes generating
> lots of feedback from users.
> In addition to these, we've seen efforts spawn off such as:
>
> HDFS on Kubernetes with Locality and Performance Experiments
> Kerberized access to HDFS from Spark running on Kubernetes
>
> Following the SPIP process, I'm putting this SPIP up for a vote.
>
> +1: Yeah, let's go forward and implement the SPIP.
> +0: Don't really care.
> -1: I don't think this is a good idea because of the following technical
> reasons.
>
> If there is any further clarification desired, on the design or the
> implementation, please feel free to ask questions or provide feedback.
>
>
> SPIP: Kubernetes as A Native Cluster Manager
>
>
> Full Design Doc: link
>
> JIRA: https://issues.apache.org/jira/browse/SPARK-18278[issues.apache.org]
>
> Kubernetes Issue: 
> https://github.com/kubernetes/kubernetes/issues/34377[github.com]
>
>
> Authors: Yinan Li, Anirudh Ramanathan, Erik Erlandson, Andrew Ash, Matt
> Cheah,
>
> Ilan Filonenko, Sean Suchter, Kimoon Kim
>
> Background and Motivation
>
> Containerization and cluster management technologies are constantly evolving
> in the cluster computing world. Apache Spark currently implements support
> for Apache Hadoop YARN and Apache Mesos, in addition to providing its own
> standalone cluster manager. In 2014, Google announced development of
> Kubernetes which has its own unique feature set and differentiates itself
> from YARN and Mesos. Since its debut, it has seen contributions from over
> 1300 contributors with over 5 commits. Kubernetes has cemented itself as
> a core player in the cluster computing world, and cloud-computing providers
> such as Google Container Engine, Google Compute Engine, Amazon Web Services,
> and Microsoft Azure support running Kubernetes clusters.
>
>
> This document outlines a proposal for integrating Apache Spark with
> Kubernet

Re: SPIP: Spark on Kubernetes

2017-08-18 Thread varunkatta
+1 (non-binding)



--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/SPIP-Spark-on-Kubernetes-tp22147p22195.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: SPIP: Spark on Kubernetes

2017-08-18 Thread Erik Erlandson
There are a fair number of people (myself included) who have interest in
making scheduler back-ends fully pluggable.  That will represent a
significant impact to core spark architecture, with corresponding risk.
Adding the kubernetes back-end in a manner similar to the other three
back-ends has had a very small impact on spark core, which allowed it to be
developed in parallel and easily stay re-based on successive spark releases
while we were developing it and building up community support.

On Thu, Aug 17, 2017 at 7:14 PM, Mridul Muralidharan 
wrote:

> While I definitely support the idea of Apache Spark being able to
> leverage kubernetes, IMO it is better for long term evolution of spark
> to expose appropriate SPI such that this support need not necessarily
> live within Apache Spark code base.
> It will allow for multiple backends to evolve, decoupled from spark core.
> In this case, would have made maintaining apache-spark-on-k8s repo
> easier; just as it would allow for supporting other backends -
> opensource (nomad for ex) and proprietary.
>
> In retrospect directly integrating yarn support into spark, while
> mirroring mesos support at that time, was probably an incorrect design
> choice on my part.
>
>
> Regards,
> Mridul
>
> On Tue, Aug 15, 2017 at 8:32 AM, Anirudh Ramanathan
>  wrote:
> > Spark on Kubernetes effort has been developed separately in a fork, and
> > linked back from the Apache Spark project as an experimental backend.
> We're
> > ~6 months in, have had 5 releases.
> >
> > 2 Spark versions maintained (2.1, and 2.2)
> > Extensive integration testing and refactoring efforts to maintain code
> > quality
> > Developer and user-facing documentation
> > 10+ consistent code contributors from different organizations involved in
> > actively maintaining and using the project, with several more members
> > involved in testing and providing feedback.
> > The community has delivered several talks on Spark-on-Kubernetes
> generating
> > lots of feedback from users.
> > In addition to these, we've seen efforts spawn off such as:
> >
> > HDFS on Kubernetes with Locality and Performance Experiments
> > Kerberized access to HDFS from Spark running on Kubernetes
> >
> > Following the SPIP process, I'm putting this SPIP up for a vote.
> >
> > +1: Yeah, let's go forward and implement the SPIP.
> > +0: Don't really care.
> > -1: I don't think this is a good idea because of the following technical
> > reasons.
> >
> > If there is any further clarification desired, on the design or the
> > implementation, please feel free to ask questions or provide feedback.
> >
> >
> > SPIP: Kubernetes as A Native Cluster Manager
> >
> >
> > Full Design Doc: link
> >
> > JIRA: https://issues.apache.org/jira/browse/SPARK-18278
> >
> > Kubernetes Issue: https://github.com/kubernetes/kubernetes/issues/34377
> >
> >
> > Authors: Yinan Li, Anirudh Ramanathan, Erik Erlandson, Andrew Ash, Matt
> > Cheah,
> >
> > Ilan Filonenko, Sean Suchter, Kimoon Kim
> >
> > Background and Motivation
> >
> > Containerization and cluster management technologies are constantly
> evolving
> > in the cluster computing world. Apache Spark currently implements support
> > for Apache Hadoop YARN and Apache Mesos, in addition to providing its own
> > standalone cluster manager. In 2014, Google announced development of
> > Kubernetes which has its own unique feature set and differentiates itself
> > from YARN and Mesos. Since its debut, it has seen contributions from over
> > 1300 contributors with over 5 commits. Kubernetes has cemented
> itself as
> > a core player in the cluster computing world, and cloud-computing
> providers
> > such as Google Container Engine, Google Compute Engine, Amazon Web
> Services,
> > and Microsoft Azure support running Kubernetes clusters.
> >
> >
> > This document outlines a proposal for integrating Apache Spark with
> > Kubernetes in a first class way, adding Kubernetes to the list of cluster
> > managers that Spark can be used with. Doing so would allow users to share
> > their computing resources and containerization framework between their
> > existing applications on Kubernetes and their computational Spark
> > applications. Although there is existing support for running a Spark
> > standalone cluster on Kubernetes, there are still major advantages and
> > significant interest in having native execution support. For example,
> this
> > integration provides better support for multi-tenancy and dynamic
> resource
> > allocation. It also allows users to run applications of different Spark
> > versions of their choices in the same cluster.
> >
> >
> > The feature is being developed in a separate fork in order to minimize
> risk
> > to the main project during development. Since the start of the
> development
> > in November of 2016, it has received over 100 commits from over 20
> > contributors and supports two releases based on Spark 2.1 and 2.2
> > respectively. Documentation is also being actively worked on