Re: SPIP: Spark on Kubernetes
+1 (non-binding) We are evaluating Kubernetes for a variety of data processing workloads. Spark is the natural choice for some of these workloads. Native Spark on Kubernetes is of interest to us as it brings in dynamic allocation, resource isolation and improved notions of security. -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/SPIP-Spark-on-Kubernetes-tp22147p22197.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com. - To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
Re: SPIP: Spark on Kubernetes
The problem with maintaining this scheduler separately right now is that the scheduler backend is dependent upon the CoarseGrainedSchedulerBackend class, which is not as much a stable API as it is an internal class with components that currently need to be shared by all of the scheduler backends. This makes it such that maintaining this scheduler requires not just maintaining a single small module, but an entire fork of the project as well, so that the cluster manager specific scheduler backend can keep up with the changes to CoarseGrainedSchedulerBackend. If we wanted to avoid forking the entire project and only provide the scheduler backend as a pluggable module, we would need a fully pluggable scheduler backend with a stable API, as Erik mentioned. We also needed to change the spark-submit code to recognize Kubernetes mode and be able to delegate to the Kubernetes submission client, so that would need to be pluggable as well. More discussion on fully pluggable scheduler backends is at https://issues.apache.org/jira/browse/SPARK-19700. -Matt Cheah From: Erik Erlandson Date: Friday, August 18, 2017 at 8:34 AM To: "dev@spark.apache.org" Subject: Re: SPIP: Spark on Kubernetes There are a fair number of people (myself included) who have interest in making scheduler back-ends fully pluggable. That will represent a significant impact to core spark architecture, with corresponding risk. Adding the kubernetes back-end in a manner similar to the other three back-ends has had a very small impact on spark core, which allowed it to be developed in parallel and easily stay re-based on successive spark releases while we were developing it and building up community support. On Thu, Aug 17, 2017 at 7:14 PM, Mridul Muralidharan wrote: While I definitely support the idea of Apache Spark being able to leverage kubernetes, IMO it is better for long term evolution of spark to expose appropriate SPI such that this support need not necessarily live within Apache Spark code base. It will allow for multiple backends to evolve, decoupled from spark core. In this case, would have made maintaining apache-spark-on-k8s repo easier; just as it would allow for supporting other backends - opensource (nomad for ex) and proprietary. In retrospect directly integrating yarn support into spark, while mirroring mesos support at that time, was probably an incorrect design choice on my part. Regards, Mridul On Tue, Aug 15, 2017 at 8:32 AM, Anirudh Ramanathan wrote: > Spark on Kubernetes effort has been developed separately in a fork, and > linked back from the Apache Spark project as an experimental backend. We're > ~6 months in, have had 5 releases. > > 2 Spark versions maintained (2.1, and 2.2) > Extensive integration testing and refactoring efforts to maintain code > quality > Developer and user-facing documentation > 10+ consistent code contributors from different organizations involved in > actively maintaining and using the project, with several more members > involved in testing and providing feedback. > The community has delivered several talks on Spark-on-Kubernetes generating > lots of feedback from users. > In addition to these, we've seen efforts spawn off such as: > > HDFS on Kubernetes with Locality and Performance Experiments > Kerberized access to HDFS from Spark running on Kubernetes > > Following the SPIP process, I'm putting this SPIP up for a vote. > > +1: Yeah, let's go forward and implement the SPIP. > +0: Don't really care. > -1: I don't think this is a good idea because of the following technical > reasons. > > If there is any further clarification desired, on the design or the > implementation, please feel free to ask questions or provide feedback. > > > SPIP: Kubernetes as A Native Cluster Manager > > > Full Design Doc: link > > JIRA: https://issues.apache.org/jira/browse/SPARK-18278[issues.apache.org] > > Kubernetes Issue: > https://github.com/kubernetes/kubernetes/issues/34377[github.com] > > > Authors: Yinan Li, Anirudh Ramanathan, Erik Erlandson, Andrew Ash, Matt > Cheah, > > Ilan Filonenko, Sean Suchter, Kimoon Kim > > Background and Motivation > > Containerization and cluster management technologies are constantly evolving > in the cluster computing world. Apache Spark currently implements support > for Apache Hadoop YARN and Apache Mesos, in addition to providing its own > standalone cluster manager. In 2014, Google announced development of > Kubernetes which has its own unique feature set and differentiates itself > from YARN and Mesos. Since its debut, it has seen contributions from over > 1300 contributors with over 5 commits. Kubernetes has cemented itself as > a core player in the cluster computing world, and cloud-computing providers > such as Google Container Engine, Google Compute Engine, Amazon Web Services, > and Microsoft Azure support running Kubernetes clusters. > > > This document outlines a proposal for integrating Apache Spark with > Kubernet
Re: SPIP: Spark on Kubernetes
+1 (non-binding) -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/SPIP-Spark-on-Kubernetes-tp22147p22195.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com. - To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
Re: SPIP: Spark on Kubernetes
There are a fair number of people (myself included) who have interest in making scheduler back-ends fully pluggable. That will represent a significant impact to core spark architecture, with corresponding risk. Adding the kubernetes back-end in a manner similar to the other three back-ends has had a very small impact on spark core, which allowed it to be developed in parallel and easily stay re-based on successive spark releases while we were developing it and building up community support. On Thu, Aug 17, 2017 at 7:14 PM, Mridul Muralidharan wrote: > While I definitely support the idea of Apache Spark being able to > leverage kubernetes, IMO it is better for long term evolution of spark > to expose appropriate SPI such that this support need not necessarily > live within Apache Spark code base. > It will allow for multiple backends to evolve, decoupled from spark core. > In this case, would have made maintaining apache-spark-on-k8s repo > easier; just as it would allow for supporting other backends - > opensource (nomad for ex) and proprietary. > > In retrospect directly integrating yarn support into spark, while > mirroring mesos support at that time, was probably an incorrect design > choice on my part. > > > Regards, > Mridul > > On Tue, Aug 15, 2017 at 8:32 AM, Anirudh Ramanathan > wrote: > > Spark on Kubernetes effort has been developed separately in a fork, and > > linked back from the Apache Spark project as an experimental backend. > We're > > ~6 months in, have had 5 releases. > > > > 2 Spark versions maintained (2.1, and 2.2) > > Extensive integration testing and refactoring efforts to maintain code > > quality > > Developer and user-facing documentation > > 10+ consistent code contributors from different organizations involved in > > actively maintaining and using the project, with several more members > > involved in testing and providing feedback. > > The community has delivered several talks on Spark-on-Kubernetes > generating > > lots of feedback from users. > > In addition to these, we've seen efforts spawn off such as: > > > > HDFS on Kubernetes with Locality and Performance Experiments > > Kerberized access to HDFS from Spark running on Kubernetes > > > > Following the SPIP process, I'm putting this SPIP up for a vote. > > > > +1: Yeah, let's go forward and implement the SPIP. > > +0: Don't really care. > > -1: I don't think this is a good idea because of the following technical > > reasons. > > > > If there is any further clarification desired, on the design or the > > implementation, please feel free to ask questions or provide feedback. > > > > > > SPIP: Kubernetes as A Native Cluster Manager > > > > > > Full Design Doc: link > > > > JIRA: https://issues.apache.org/jira/browse/SPARK-18278 > > > > Kubernetes Issue: https://github.com/kubernetes/kubernetes/issues/34377 > > > > > > Authors: Yinan Li, Anirudh Ramanathan, Erik Erlandson, Andrew Ash, Matt > > Cheah, > > > > Ilan Filonenko, Sean Suchter, Kimoon Kim > > > > Background and Motivation > > > > Containerization and cluster management technologies are constantly > evolving > > in the cluster computing world. Apache Spark currently implements support > > for Apache Hadoop YARN and Apache Mesos, in addition to providing its own > > standalone cluster manager. In 2014, Google announced development of > > Kubernetes which has its own unique feature set and differentiates itself > > from YARN and Mesos. Since its debut, it has seen contributions from over > > 1300 contributors with over 5 commits. Kubernetes has cemented > itself as > > a core player in the cluster computing world, and cloud-computing > providers > > such as Google Container Engine, Google Compute Engine, Amazon Web > Services, > > and Microsoft Azure support running Kubernetes clusters. > > > > > > This document outlines a proposal for integrating Apache Spark with > > Kubernetes in a first class way, adding Kubernetes to the list of cluster > > managers that Spark can be used with. Doing so would allow users to share > > their computing resources and containerization framework between their > > existing applications on Kubernetes and their computational Spark > > applications. Although there is existing support for running a Spark > > standalone cluster on Kubernetes, there are still major advantages and > > significant interest in having native execution support. For example, > this > > integration provides better support for multi-tenancy and dynamic > resource > > allocation. It also allows users to run applications of different Spark > > versions of their choices in the same cluster. > > > > > > The feature is being developed in a separate fork in order to minimize > risk > > to the main project during development. Since the start of the > development > > in November of 2016, it has received over 100 commits from over 20 > > contributors and supports two releases based on Spark 2.1 and 2.2 > > respectively. Documentation is also being actively worked on