Re: SPIP: Spark on Kubernetes

Erik Erlandson Fri, 18 Aug 2017 08:35:08 -0700

There are a fair number of people (myself included) who have interest in
making scheduler back-ends fully pluggable.  That will represent a
significant impact to core spark architecture, with corresponding risk.
Adding the kubernetes back-end in a manner similar to the other three
back-ends has had a very small impact on spark core, which allowed it to be
developed in parallel and easily stay re-based on successive spark releases
while we were developing it and building up community support.


On Thu, Aug 17, 2017 at 7:14 PM, Mridul Muralidharan <mri...@gmail.com>
wrote:

> While I definitely support the idea of Apache Spark being able to
> leverage kubernetes, IMO it is better for long term evolution of spark
> to expose appropriate SPI such that this support need not necessarily
> live within Apache Spark code base.
> It will allow for multiple backends to evolve, decoupled from spark core.
> In this case, would have made maintaining apache-spark-on-k8s repo
> easier; just as it would allow for supporting other backends -
> opensource (nomad for ex) and proprietary.
>
> In retrospect directly integrating yarn support into spark, while
> mirroring mesos support at that time, was probably an incorrect design
> choice on my part.
>
>
> Regards,
> Mridul
>
> On Tue, Aug 15, 2017 at 8:32 AM, Anirudh Ramanathan
> <fox...@google.com.invalid> wrote:
> > Spark on Kubernetes effort has been developed separately in a fork, and
> > linked back from the Apache Spark project as an experimental backend.
> We're
> > ~6 months in, have had 5 releases.
> >
> > 2 Spark versions maintained (2.1, and 2.2)
> > Extensive integration testing and refactoring efforts to maintain code
> > quality
> > Developer and user-facing documentation
> > 10+ consistent code contributors from different organizations involved in
> > actively maintaining and using the project, with several more members
> > involved in testing and providing feedback.
> > The community has delivered several talks on Spark-on-Kubernetes
> generating
> > lots of feedback from users.
> > In addition to these, we've seen efforts spawn off such as:
> >
> > HDFS on Kubernetes with Locality and Performance Experiments
> > Kerberized access to HDFS from Spark running on Kubernetes
> >
> > Following the SPIP process, I'm putting this SPIP up for a vote.
> >
> > +1: Yeah, let's go forward and implement the SPIP.
> > +0: Don't really care.
> > -1: I don't think this is a good idea because of the following technical
> > reasons.
> >
> > If there is any further clarification desired, on the design or the
> > implementation, please feel free to ask questions or provide feedback.
> >
> >
> > SPIP: Kubernetes as A Native Cluster Manager
> >
> >
> > Full Design Doc: link
> >
> > JIRA: https://issues.apache.org/jira/browse/SPARK-18278
> >
> > Kubernetes Issue: https://github.com/kubernetes/kubernetes/issues/34377
> >
> >
> > Authors: Yinan Li, Anirudh Ramanathan, Erik Erlandson, Andrew Ash, Matt
> > Cheah,
> >
> > Ilan Filonenko, Sean Suchter, Kimoon Kim
> >
> > Background and Motivation
> >
> > Containerization and cluster management technologies are constantly
> evolving
> > in the cluster computing world. Apache Spark currently implements support
> > for Apache Hadoop YARN and Apache Mesos, in addition to providing its own
> > standalone cluster manager. In 2014, Google announced development of
> > Kubernetes which has its own unique feature set and differentiates itself
> > from YARN and Mesos. Since its debut, it has seen contributions from over
> > 1300 contributors with over 50000 commits. Kubernetes has cemented
> itself as
> > a core player in the cluster computing world, and cloud-computing
> providers
> > such as Google Container Engine, Google Compute Engine, Amazon Web
> Services,
> > and Microsoft Azure support running Kubernetes clusters.
> >
> >
> > This document outlines a proposal for integrating Apache Spark with
> > Kubernetes in a first class way, adding Kubernetes to the list of cluster
> > managers that Spark can be used with. Doing so would allow users to share
> > their computing resources and containerization framework between their
> > existing applications on Kubernetes and their computational Spark
> > applications. Although there is existing support for running a Spark
> > standalone cluster on Kubernetes, there are still major advantages and
> > significant interest in having native execution support. For example,
> this
> > integration provides better support for multi-tenancy and dynamic
> resource
> > allocation. It also allows users to run applications of different Spark
> > versions of their choices in the same cluster.
> >
> >
> > The feature is being developed in a separate fork in order to minimize
> risk
> > to the main project during development. Since the start of the
> development
> > in November of 2016, it has received over 100 commits from over 20
> > contributors and supports two releases based on Spark 2.1 and 2.2
> > respectively. Documentation is also being actively worked on both in the
> > main project repository and also in the repository
> > https://github.com/apache-spark-on-k8s/userdocs. Regarding real-world
> use
> > cases, we have seen cluster setup that uses 1000+ cores. We are also
> seeing
> > growing interests on this project from more and more organizations.
> >
> >
> > While it is easy to bootstrap the project in a forked repository, it is
> hard
> > to maintain it in the long run because of the tricky process of rebasing
> > onto the upstream and lack of awareness in the large Spark community. It
> > would be beneficial to both the Spark and Kubernetes community seeing
> this
> > feature being merged upstream. On one hand, it gives Spark users the
> option
> > of running their Spark workloads along with other workloads that may
> already
> > be running on Kubernetes, enabling better resource sharing and isolation,
> > and better cluster administration. On the other hand, it gives
> Kubernetes a
> > leap forward in the area of large-scale data processing by being an
> > officially supported cluster manager for Spark. The risk of merging into
> > upstream is low because most of the changes are purely incremental, i.e.,
> > new Kubernetes-aware implementations of existing interfaces/classes in
> Spark
> > core are introduced. The development is also concentrated in a single
> place
> > at resource-managers/kubernetes. The risk is further reduced by a
> > comprehensive integration test framework, and an active and responsive
> > community of future maintainers.
> >
> > Target Personas
> >
> > Devops, data scientists, data engineers, application developers, anyone
> who
> > can benefit from having Kubernetes as a native cluster manager for Spark.
> >
> > Goals
> >
> > Make Kubernetes a first-class cluster manager for Spark, alongside Spark
> > Standalone, Yarn, and Mesos.
> >
> > Support both client and cluster deployment mode.
> >
> > Support dynamic resource allocation.
> >
> > Support Spark Java/Scala, PySpark, and Spark R applications.
> >
> > Support secure HDFS access.
> >
> > Allow running applications of different Spark versions in the same
> cluster
> > through the ability to specify the driver and executor Docker images on a
> > per-application basis.
> >
> > Support specification and enforcement of limits on both CPU cores and
> > memory.
> >
> > Non-Goals
> >
> > Support cluster resource scheduling and sharing beyond capabilities
> offered
> > natively by the Kubernetes per-namespace resource quota model.
> >
> > Proposed API Changes
> >
> > Most API changes are purely incremental, i.e., new Kubernetes-aware
> > implementations of existing interfaces/classes in Spark core are
> introduced.
> > Detailed changes are as follows.
> >
> > A new cluster manager option KUBERNETES is introduced and some changes
> are
> > made to SparkSubmit to make it be aware of this option.
> >
> > A new implementation of CoarseGrainedSchedulerBackend, namely
> > KubernetesClusterSchedulerBackend is responsible for managing the
> creation
> > and deletion of executor Pods through the Kubernetes API.
> >
> > A new implementation of TaskSchedulerImpl, namely
> > KubernetesTaskSchedulerImpl, and a new implementation of TaskSetManager,
> > namely Kubernetes TaskSetManager, are introduced for Kubernetes-aware
> task
> > scheduling.
> >
> > When dynamic resource allocation is enabled, a new implementation of
> > ExternalShuffleService, namely KubernetesExternalShuffleService is
> > introduced.
> >
> > Design Sketch
> >
> > Below we briefly describe the design. For more details on the design and
> > architecture, please refer to the architecture documentation. The main
> idea
> > of this design is to run Spark driver and executors inside Kubernetes
> Pods.
> > Pods are a co-located and co-scheduled group of one or more containers
> run
> > in a shared context. The driver is responsible for creating and
> destroying
> > executor Pods through the Kubernetes API, while Kubernetes is fully
> > responsible for scheduling the Pods to run on available nodes in the
> > cluster. In the cluster mode, the driver also runs in a Pod in the
> cluster,
> > created through the Kubernetes API by a Kubernetes-aware submission
> client
> > called by the spark-submit script. Because the driver runs in a Pod, it
> is
> > reachable by the executors in the cluster using its Pod IP. In the client
> > mode, the driver runs outside the cluster and calls the Kubernetes API to
> > create and destroy executor Pods. The driver must be routable from within
> > the cluster for the executors to communicate with it.
> >
> >
> > The main component running in the driver is the
> > KubernetesClusterSchedulerBackend, an implementation of
> > CoarseGrainedSchedulerBackend, which manages allocating and destroying
> > executors via the Kubernetes API, as instructed by Spark core via calls
> to
> > methods doRequestTotalExecutors and doKillExecutors, respectively. Within
> > the KubernetesClusterSchedulerBackend, a separate
> kubernetes-pod-allocator
> > thread handles the creation of new executor Pods with appropriate
> throttling
> > and monitoring. Throttling is achieved using a feedback loop that makes
> > decision on submitting new requests for executors based on whether
> previous
> > executor Pod creation requests have completed. This indirection is
> necessary
> > because the Kubernetes API server accepts requests for new Pods
> > optimistically, with the anticipation of being able to eventually
> schedule
> > them to run. However, it is undesirable to have a very large number of
> Pods
> > that cannot be scheduled and stay pending within the cluster. The
> throttling
> > mechanism gives us control over how fast an application scales up (which
> can
> > be configured), and helps prevent Spark applications from DOS-ing the
> > Kubernetes API server with too many Pod creation requests. The executor
> Pods
> > simply run the CoarseGrainedExecutorBackend class from a pre-built Docker
> > image that contains a Spark distribution.
> >
> >
> > There are auxiliary and optional components: ResourceStagingServer and
> > KubernetesExternalShuffleService, which serve specific purposes
> described
> > below. The ResourceStagingServer serves as a file store (in the absence
> of a
> > persistent storage layer in Kubernetes) for application dependencies
> > uploaded from the submission client machine, which then get downloaded
> from
> > the server by the init-containers in the driver and executor Pods. It is
> a
> > Jetty server with JAX-RS and has two endpoints for uploading and
> downloading
> > files, respectively. Security tokens are returned in the responses for
> file
> > uploading and must be carried in the requests for downloading the files.
> The
> > ResourceStagingServer is deployed as a Kubernetes Service backed by a
> > Deployment in the cluster and multiple instances may be deployed in the
> same
> > cluster. Spark applications specify which ResourceStagingServer instance
> to
> > use through a configuration property.
> >
> >
> > The KubernetesExternalShuffleService is used to support dynamic resource
> > allocation, with which the number of executors of a Spark application can
> > change at runtime based on the resource needs. It provides an additional
> > endpoint for drivers that allows the shuffle service to delete driver
> > termination and clean up the shuffle files associated with corresponding
> > application. There are two ways of deploying the
> > KubernetesExternalShuffleService: running a shuffle service Pod on each
> node
> > in the cluster or a subset of the nodes using a DaemonSet, or running a
> > shuffle service container in each of the executor Pods. In the first
> option,
> > each shuffle service container mounts a hostPath volume. The same
> hostPath
> > volume is also mounted by each of the executor containers, which must
> also
> > have the environment variable SPARK_LOCAL_DIRS point to the hostPath. In
> the
> > second option, a shuffle service container is co-located with an executor
> > container in each of the executor Pods. The two containers share an
> emptyDir
> > volume where the shuffle data gets written to. There may be multiple
> > instances of the shuffle service deployed in a cluster that may be used
> for
> > different versions of Spark, or for different priority levels with
> different
> > resource quotas.
> >
> >
> > New Kubernetes-specific configuration options are also introduced to
> > facilitate specification and customization of driver and executor Pods
> and
> > related Kubernetes resources. For example, driver and executor Pods can
> be
> > created in a particular Kubernetes namespace and on a particular set of
> the
> > nodes in the cluster. Users are allowed to apply labels and annotations
> to
> > the driver and executor Pods.
> >
> >
> > Additionally, secure HDFS support is being actively worked on following
> the
> > design here. Both short-running jobs and long-running jobs that need
> > periodic delegation token refresh are supported, leveraging built-in
> > Kubernetes constructs like Secrets. Please refer to the design doc for
> > details.
> >
> > Rejected Designs
> >
> > Resource Staging by the Driver
> >
> > A first implementation effectively included the ResourceStagingServer in
> the
> > driver container itself. The driver container ran a custom command that
> > opened an HTTP endpoint and waited for the submission client to send
> > resources to it. The server would then run the driver code after it had
> > received the resources from the submission client machine. The problem
> with
> > this approach is that the submission client needs to deploy the driver in
> > such a way that the driver itself would be reachable from outside of the
> > cluster, but it is difficult for an automated framework which is not
> aware
> > of the cluster's configuration to expose an arbitrary pod in a generic
> way.
> > The Service-based design chosen allows a cluster administrator to expose
> the
> > ResourceStagingServer in a manner that makes sense for their cluster,
> such
> > as with an Ingress or with a NodePort service.
> >
> > Kubernetes External Shuffle Service
> >
> > Several alternatives were considered for the design of the shuffle
> service.
> > The first design postulated the use of long-lived executor pods and
> sidecar
> > containers in them running the shuffle service. The advantage of this
> model
> > was that it would let us use emptyDir for sharing as opposed to using
> node
> > local storage, which guarantees better lifecycle management of storage by
> > Kubernetes. The apparent disadvantage was that it would be a departure
> from
> > the traditional Spark methodology of keeping executors for only as long
> as
> > required in dynamic allocation mode. It would additionally use up more
> > resources than strictly necessary during the course of long-running jobs,
> > partially losing the advantage of dynamic scaling.
> >
> >
> > Another alternative considered was to use a separate shuffle service
> manager
> > as a nameserver. This design has a few drawbacks. First, this means
> another
> > component that needs authentication/authorization management and
> > maintenance. Second, this separate component needs to be kept in sync
> with
> > the Kubernetes cluster. Last but not least, most of functionality of this
> > separate component can be performed by a combination of the in-cluster
> > shuffle service and the Kubernetes API server.
> >
> > Pluggable Scheduler Backends
> >
> > Fully pluggable scheduler backends were considered as a more generalized
> > solution, and remain interesting as a possible avenue for future-proofing
> > against new scheduling targets.  For the purposes of this project,
> adding a
> > new specialized scheduler backend for Kubernetes was chosen as the
> approach
> > due to its very low impact on the core Spark code; making scheduler fully
> > pluggable would be a high-impact high-risk modification to Spark’s core
> > libraries. The pluggable scheduler backends effort is being tracked in
> > JIRA-19700.
> >
> >
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>

Re: SPIP: Spark on Kubernetes

Reply via email to