Re: SPIP: Spark on Kubernetes

Erik Erlandson Mon, 28 Aug 2017 15:10:24 -0700

In addition to the engineering & software aspects of the native Kubernetes
community project, we have also worked at building out the community, with
the goal of providing the foundation for sustaining engineering on the
Kubernetes scheduler back-end.  That said, I agree 100% with your point
that adding committers with kube-specific experience is good strategy for
increasing review bandwidth to help service PRs from this community.


On Mon, Aug 28, 2017 at 2:16 PM, Mark Hamstra <[email protected]>
wrote:

> In my opinion, the fact that there are nearly no changes to spark-core,
>> and most of our changes are additive should go to prove that this adds
>> little complexity to the workflow of the committers.
>
>
> Actually (and somewhat perversely), the otherwise praiseworthy isolation
> of the Kubernetes code does mean that it adds complexity to the workflow of
> the existing Spark committers. I'll reiterate Imran's concerns: The
> existing Spark committers familiar with Spark's scheduler code have
> adequate knowledge of the Standalone and Yarn implementations, and still
> not sufficient coverage of Mesos. Adding k8s code to Spark would mean that
> the progression of that code would start seeing the issues that the Mesos
> code in Spark currently sees: Reviews and commits tend to languish because
> we don't have currently active committers with sufficient knowledge and
> cycles to deal with the Mesos PRs. Some of this is because the PMC needs to
> get back to addressing the issue of adding new Spark committers who do have
> the needed Mesos skills, but that isn't as simple as we'd like because
> ideally a Spark committer has demonstrated skills across a significant
> portion of the Spark code, not just tightly focused on one area (such as
> Mesos or k8s integration.) In short, adding Kubernetes support directly
> into Spark isn't likely (at least in the short-term) to be entirely
> positive for the spark-on-k8s project, since merging of PRs to the
> spark-on-k8s is very likely to be quite slow at least until such time as we
> have k8s-focused Spark committers. If this project does end up getting
> pulled into the Spark codebase, then the PMC will need to start looking at
> bringing in one or more new committers who meet our requirements for such a
> role and responsibility, and who also have k8s skills. The success and pace
> of development of the spark-on-k8s will depend in large measure on the
> PMC's ability to find such new committers.
>
> All that said, I'm +1 if the those currently responsible for the
> spark-on-k8s project still want to bring the code into Spark.
>
>
> On Mon, Aug 21, 2017 at 11:48 AM, Anirudh Ramanathan <
> [email protected]> wrote:
>
>> Thank you for your comments Imran.
>>
>> Regarding integration tests,
>>
>> What you inferred from the documentation is correct -
>> Integration tests do not require any prior setup or a Kubernetes cluster
>> to run. Minikube is a single binary that brings up a one-node cluster and
>> exposes the full Kubernetes API. It is actively maintained and kept up to
>> date with the rest of the project. These local integration tests on Jenkins
>> (like the ones with spark-on-yarn), should allow for the committers to
>> merge changes with a high degree of confidence.
>> I will update the proposal to include more information about the extent
>> and kinds of testing we do.
>>
>> As for (b), people on this thread and the set of contributors on our fork
>> are a fairly wide community of contributors and committers who would be
>> involved in the maintenance long-term. It was one of the reasons behind
>> developing separately as a fork. In my opinion, the fact that there are
>> nearly no changes to spark-core, and most of our changes are additive
>> should go to prove that this adds little complexity to the workflow of the
>> committers.
>>
>> Separating out the cluster managers (into an as yet undecided new home)
>> appears far more disruptive and a high risk change for the short term.
>> However, when there is enough community support behind that effort, tracked
>> in 19700 <https://issues.apache.org/jira/browse/SPARK-19700>; and if
>> that is realized in the future, it wouldn't be difficult to switch over
>> Kubernetes, YARN and Mesos to using the pluggable API. Currently, in my
>> opinion, with the integration tests, active users, and a community of
>> maintainers, Spark-on-Kubernetes would add minimal overhead and benefit a
>> large (and growing) class of users.
>>
>> Lastly, the RSS is indeed separate and a value-add that we would love to
>> share with other cluster managers as well.
>>
>> On Mon, Aug 21, 2017 at 10:17 AM, Imran Rashid <[email protected]>
>> wrote:
>>
>>> Overall this looks like a good proposal.  I do have some concerns which
>>> I'd like to discuss -- please understand I'm taking a "devil's advocate"
>>> stance here for discussion, not that I'm giving a -1.
>>>
>>> My primary concern is about testing and maintenance.  My concerns might
>>> be addressed if the doc included a section on testing that might just be
>>> this: https://github.com/apache-spark-on-k8s/spark/blob/branch-2.2
>>> -kubernetes/resource-managers/kubernetes/README.md#running-
>>> the-kubernetes-integration-tests
>>>
>>> but without the concerning warning "Note that the integration test
>>> framework is currently being heavily revised and is subject to change".
>>> I'd like the proposal to clearly indicate that some baseline testing can be
>>> done by devs and in spark's regular jenkins builds without special access
>>> to kubernetes clusters.
>>>
>>> Its worth noting that there *are* advantages to keeping it outside Spark:
>>> * when making changes to spark's scheduler, we do *not* have to worry
>>> about how those changes impact kubernetes.  This simplifies things for
>>> those making changes to spark
>>> * making changes changes to the kubernetes integration is not blocked by
>>> getting enough attention from spark's committers
>>>
>>> or in other words, each community of experts can maintain its focus.  I
>>> have these concerns based on past experience with the mesos integration --
>>> mesos contributors are blocked on committers reviewing their changes, and
>>> then committers have no idea how to test that the changes are correct, and
>>> find it hard to even learn the ins and outs of that code without access to
>>> a mesos cluster.
>>>
>>> The same could be said for the yarn integration, but I think its helped
>>> that (a) spark-on-yarn *does* have local tests for testing basic
>>> integration and (b) there is a sufficient community of contributors and
>>> committers for spark-on-yarn.   I realize (b) is a chicken-and-egg problem,
>>> but I'd like to be sure that at least (a) is addressed.  (and maybe even
>>> spark-on-yarn shouldln't be inside spark itself, as mridul said, but its
>>> not clear what the other home should be.)
>>>
>>> At some point, this is just a judgement call, of the value it brings to
>>> the spark community vs the added complexity.  I'm willing to believe that
>>> kubernetes will bring enough value to make this worthwhile, just voicing my
>>> concerns.
>>>
>>> Secondary concern:
>>> the RSS doesn't seem necessary for kubernetes support, or specific to
>>> it.  If its nice to have, and you want to add it to kubernetes first before
>>> other cluster managers, fine, but seems separate from this proposal.
>>>
>>>
>>>
>>> On Tue, Aug 15, 2017 at 10:32 AM, Anirudh Ramanathan <
>>> [email protected]> wrote:
>>>
>>>> Spark on Kubernetes effort has been developed separately in a fork, and
>>>> linked back from the Apache Spark project as an experimental backend
>>>> <http://spark.apache.org/docs/latest/cluster-overview.html#cluster-manager-types>.
>>>> We're ~6 months in, have had 5 releases
>>>> <https://github.com/apache-spark-on-k8s/spark/releases>.
>>>>
>>>>    - 2 Spark versions maintained (2.1, and 2.2)
>>>>    - Extensive integration testing and refactoring efforts to maintain
>>>>    code quality
>>>>    - Developer
>>>>    <https://github.com/apache-spark-on-k8s/spark#getting-started> and
>>>>    user-facing <https://apache-spark-on-k8s.github.io/userdocs/> docu
>>>>    mentation
>>>>    - 10+ consistent code contributors from different organizations
>>>>    
>>>> <https://apache-spark-on-k8s.github.io/userdocs/contribute.html#project-contributions>
>>>>  involved
>>>>    in actively maintaining and using the project, with several more members
>>>>    involved in testing and providing feedback.
>>>>    - The community has delivered several talks on Spark-on-Kubernetes
>>>>    generating lots of feedback from users.
>>>>    - In addition to these, we've seen efforts spawn off such as:
>>>>    - HDFS on Kubernetes
>>>>       <https://github.com/apache-spark-on-k8s/kubernetes-HDFS> with
>>>>       Locality and Performance Experiments
>>>>       - Kerberized access
>>>>       
>>>> <https://docs.google.com/document/d/1RBnXD9jMDjGonOdKJ2bA1lN4AAV_1RwpU_ewFuCNWKg/edit>
>>>>  to
>>>>       HDFS from Spark running on Kubernetes
>>>>
>>>> *Following the SPIP process, I'm putting this SPIP up for a vote.*
>>>>
>>>>    - +1: Yeah, let's go forward and implement the SPIP.
>>>>    - +0: Don't really care.
>>>>    - -1: I don't think this is a good idea because of the following
>>>>    technical reasons.
>>>>
>>>> If there is any further clarification desired, on the design or the
>>>> implementation, please feel free to ask questions or provide feedback.
>>>>
>>>>
>>>> SPIP: Kubernetes as A Native Cluster Manager
>>>>
>>>> Full Design Doc: link
>>>> <https://issues.apache.org/jira/secure/attachment/12881586/SPARK-18278%20Spark%20on%20Kubernetes%20Design%20Proposal%20Revision%202%20%281%29.pdf>
>>>>
>>>> JIRA: https://issues.apache.org/jira/browse/SPARK-18278
>>>>
>>>> Kubernetes Issue: https://github.com/kubernetes/kubernetes/issues/34377
>>>>
>>>> Authors: Yinan Li, Anirudh Ramanathan, Erik Erlandson, Andrew Ash, Matt
>>>> Cheah,
>>>>
>>>> Ilan Filonenko, Sean Suchter, Kimoon Kim
>>>> Background and Motivation
>>>>
>>>> Containerization and cluster management technologies are constantly
>>>> evolving in the cluster computing world. Apache Spark currently implements
>>>> support for Apache Hadoop YARN and Apache Mesos, in addition to providing
>>>> its own standalone cluster manager. In 2014, Google announced development
>>>> of Kubernetes <https://kubernetes.io/> which has its own unique
>>>> feature set and differentiates itself from YARN and Mesos. Since its debut,
>>>> it has seen contributions from over 1300 contributors with over 50000
>>>> commits. Kubernetes has cemented itself as a core player in the cluster
>>>> computing world, and cloud-computing providers such as Google Container
>>>> Engine, Google Compute Engine, Amazon Web Services, and Microsoft Azure
>>>> support running Kubernetes clusters.
>>>>
>>>> This document outlines a proposal for integrating Apache Spark with
>>>> Kubernetes in a first class way, adding Kubernetes to the list of cluster
>>>> managers that Spark can be used with. Doing so would allow users to share
>>>> their computing resources and containerization framework between their
>>>> existing applications on Kubernetes and their computational Spark
>>>> applications. Although there is existing support for running a Spark
>>>> standalone cluster on Kubernetes
>>>> <https://github.com/kubernetes/examples/blob/master/staging/spark/README.md>,
>>>> there are still major advantages and significant interest in having native
>>>> execution support. For example, this integration provides better support
>>>> for multi-tenancy and dynamic resource allocation. It also allows users to
>>>> run applications of different Spark versions of their choices in the same
>>>> cluster.
>>>>
>>>> The feature is being developed in a separate fork
>>>> <https://github.com/apache-spark-on-k8s/spark> in order to minimize
>>>> risk to the main project during development. Since the start of the
>>>> development in November of 2016, it has received over 100 commits from over
>>>> 20 contributors and supports two releases based on Spark 2.1 and 2.2
>>>> respectively. Documentation is also being actively worked on both in the
>>>> main project repository and also in the repository
>>>> https://github.com/apache-spark-on-k8s/userdocs. Regarding real-world
>>>> use cases, we have seen cluster setup that uses 1000+ cores. We are also
>>>> seeing growing interests on this project from more and more organizations.
>>>>
>>>> While it is easy to bootstrap the project in a forked repository, it is
>>>> hard to maintain it in the long run because of the tricky process of
>>>> rebasing onto the upstream and lack of awareness in the large Spark
>>>> community. It would be beneficial to both the Spark and Kubernetes
>>>> community seeing this feature being merged upstream. On one hand, it gives
>>>> Spark users the option of running their Spark workloads along with other
>>>> workloads that may already be running on Kubernetes, enabling better
>>>> resource sharing and isolation, and better cluster administration. On the
>>>> other hand, it gives Kubernetes a leap forward in the area of large-scale
>>>> data processing by being an officially supported cluster manager for Spark.
>>>> The risk of merging into upstream is low because most of the changes are
>>>> purely incremental, i.e., new Kubernetes-aware implementations of existing
>>>> interfaces/classes in Spark core are introduced. The development is also
>>>> concentrated in a single place at resource-managers/kubernetes
>>>> <https://github.com/apache-spark-on-k8s/spark/tree/branch-2.2-kubernetes/resource-managers/kubernetes>.
>>>> The risk is further reduced by a comprehensive integration test framework,
>>>> and an active and responsive community of future maintainers.
>>>> Target Personas
>>>>
>>>> Devops, data scientists, data engineers, application developers, anyone
>>>> who can benefit from having Kubernetes
>>>> <https://kubernetes.io/docs/concepts/overview/what-is-kubernetes/> as
>>>> a native cluster manager for Spark.
>>>> Goals
>>>>
>>>>    -
>>>>
>>>>    Make Kubernetes a first-class cluster manager for Spark, alongside
>>>>    Spark Standalone, Yarn, and Mesos.
>>>>    -
>>>>
>>>>    Support both client and cluster deployment mode.
>>>>    -
>>>>
>>>>    Support dynamic resource allocation
>>>>    
>>>> <http://spark.apache.org/docs/latest/job-scheduling.html#dynamic-resource-allocation>
>>>>    .
>>>>    -
>>>>
>>>>    Support Spark Java/Scala, PySpark, and Spark R applications.
>>>>    -
>>>>
>>>>    Support secure HDFS access.
>>>>    -
>>>>
>>>>    Allow running applications of different Spark versions in the same
>>>>    cluster through the ability to specify the driver and executor Docker
>>>>    images on a per-application basis.
>>>>    -
>>>>
>>>>    Support specification and enforcement of limits on both CPU cores
>>>>    and memory.
>>>>
>>>> Non-Goals
>>>>
>>>>    -
>>>>
>>>>    Support cluster resource scheduling and sharing beyond capabilities
>>>>    offered natively by the Kubernetes per-namespace resource quota model.
>>>>
>>>> Proposed API Changes
>>>>
>>>> Most API changes are purely incremental, i.e., new Kubernetes-aware
>>>> implementations of existing interfaces/classes in Spark core are
>>>> introduced. Detailed changes are as follows.
>>>>
>>>>    -
>>>>
>>>>    A new cluster manager option KUBERNETES is introduced and some
>>>>    changes are made to SparkSubmit to make it be aware of this option.
>>>>    -
>>>>
>>>>    A new implementation of CoarseGrainedSchedulerBackend, namely
>>>>    KubernetesClusterSchedulerBackend is responsible for managing the
>>>>    creation and deletion of executor Pods through the Kubernetes API.
>>>>    -
>>>>
>>>>    A new implementation of TaskSchedulerImpl, namely
>>>>    KubernetesTaskSchedulerImpl, and a new implementation of
>>>>    TaskSetManager, namely Kubernetes TaskSetManager, are introduced
>>>>    for Kubernetes-aware task scheduling.
>>>>    -
>>>>
>>>>    When dynamic resource allocation is enabled, a new implementation
>>>>    of ExternalShuffleService, namely KubernetesExternalShuffleService
>>>>    is introduced.
>>>>
>>>> Design Sketch
>>>>
>>>> Below we briefly describe the design. For more details on the design
>>>> and architecture, please refer to the architecture documentation
>>>> <https://github.com/apache-spark-on-k8s/spark/tree/branch-2.2-kubernetes/resource-managers/kubernetes/architecture-docs>.
>>>> The main idea of this design is to run Spark driver and executors inside
>>>> Kubernetes Pods
>>>> <https://kubernetes.io/docs/concepts/workloads/pods/pod/>. Pods are a
>>>> co-located and co-scheduled group of one or more containers run in a shared
>>>> context. The driver is responsible for creating and destroying executor
>>>> Pods through the Kubernetes API, while Kubernetes is fully responsible for
>>>> scheduling the Pods to run on available nodes in the cluster. In the
>>>> cluster mode, the driver also runs in a Pod in the cluster, created through
>>>> the Kubernetes API by a Kubernetes-aware submission client called by the
>>>> spark-submit script. Because the driver runs in a Pod, it is reachable
>>>> by the executors in the cluster using its Pod IP. In the client mode, the
>>>> driver runs outside the cluster and calls the Kubernetes API to create and
>>>> destroy executor Pods. The driver must be routable from within the cluster
>>>> for the executors to communicate with it.
>>>>
>>>> The main component running in the driver is the
>>>> KubernetesClusterSchedulerBackend, an implementation of
>>>> CoarseGrainedSchedulerBackend, which manages allocating and destroying
>>>> executors via the Kubernetes API, as instructed by Spark core via calls to
>>>> methods doRequestTotalExecutors and doKillExecutors, respectively.
>>>> Within the KubernetesClusterSchedulerBackend, a separate
>>>> kubernetes-pod-allocator thread handles the creation of new executor
>>>> Pods with appropriate throttling and monitoring. Throttling is achieved
>>>> using a feedback loop that makes decision on submitting new requests for
>>>> executors based on whether previous executor Pod creation requests have
>>>> completed. This indirection is necessary because the Kubernetes API server
>>>> accepts requests for new Pods optimistically, with the anticipation of
>>>> being able to eventually schedule them to run. However, it is undesirable
>>>> to have a very large number of Pods that cannot be scheduled and stay
>>>> pending within the cluster. The throttling mechanism gives us control over
>>>> how fast an application scales up (which can be configured), and helps
>>>> prevent Spark applications from DOS-ing the Kubernetes API server with too
>>>> many Pod creation requests. The executor Pods simply run the
>>>> CoarseGrainedExecutorBackend class from a pre-built Docker image that
>>>> contains a Spark distribution.
>>>>
>>>> There are auxiliary and optional components: ResourceStagingServer and
>>>> KubernetesExternalShuffleService, which serve specific purposes
>>>> described below. The ResourceStagingServer serves as a file store (in
>>>> the absence of a persistent storage layer in Kubernetes) for application
>>>> dependencies uploaded from the submission client machine, which then get
>>>> downloaded from the server by the init-containers in the driver and
>>>> executor Pods. It is a Jetty server with JAX-RS and has two endpoints for
>>>> uploading and downloading files, respectively. Security tokens are returned
>>>> in the responses for file uploading and must be carried in the requests for
>>>> downloading the files. The ResourceStagingServer is deployed as a
>>>> Kubernetes Service
>>>> <https://kubernetes.io/docs/concepts/services-networking/service/>
>>>> backed by a Deployment
>>>> <https://kubernetes.io/docs/concepts/workloads/controllers/deployment/>
>>>> in the cluster and multiple instances may be deployed in the same cluster.
>>>> Spark applications specify which ResourceStagingServer instance to use
>>>> through a configuration property.
>>>>
>>>> The KubernetesExternalShuffleService is used to support dynamic
>>>> resource allocation, with which the number of executors of a Spark
>>>> application can change at runtime based on the resource needs. It provides
>>>> an additional endpoint for drivers that allows the shuffle service to
>>>> delete driver termination and clean up the shuffle files associated with
>>>> corresponding application. There are two ways of deploying the
>>>> KubernetesExternalShuffleService: running a shuffle service Pod on
>>>> each node in the cluster or a subset of the nodes using a DaemonSet
>>>> <https://kubernetes.io/docs/concepts/workloads/controllers/daemonset/>,
>>>> or running a shuffle service container in each of the executor Pods. In the
>>>> first option, each shuffle service container mounts a hostPath
>>>> <https://kubernetes.io/docs/concepts/storage/volumes/#hostpath>
>>>> volume. The same hostPath volume is also mounted by each of the executor
>>>> containers, which must also have the environment variable
>>>> SPARK_LOCAL_DIRS point to the hostPath. In the second option, a
>>>> shuffle service container is co-located with an executor container in each
>>>> of the executor Pods. The two containers share an emptyDir
>>>> <https://kubernetes.io/docs/concepts/storage/volumes/#emptydir> volume
>>>> where the shuffle data gets written to. There may be multiple instances of
>>>> the shuffle service deployed in a cluster that may be used for different
>>>> versions of Spark, or for different priority levels with different resource
>>>> quotas.
>>>>
>>>> New Kubernetes-specific configuration options are also introduced to
>>>> facilitate specification and customization of driver and executor Pods and
>>>> related Kubernetes resources. For example, driver and executor Pods can be
>>>> created in a particular Kubernetes namespace and on a particular set of the
>>>> nodes in the cluster. Users are allowed to apply labels and annotations to
>>>> the driver and executor Pods.
>>>>
>>>> Additionally, secure HDFS support is being actively worked on following
>>>> the design here
>>>> <https://docs.google.com/document/d/1RBnXD9jMDjGonOdKJ2bA1lN4AAV_1RwpU_ewFuCNWKg/edit>.
>>>> Both short-running jobs and long-running jobs that need periodic delegation
>>>> token refresh are supported, leveraging built-in Kubernetes constructs like
>>>> Secrets. Please refer to the design doc for details.
>>>> Rejected DesignsResource Staging by the Driver
>>>>
>>>> A first implementation effectively included the ResourceStagingServer
>>>> in the driver container itself. The driver container ran a custom command
>>>> that opened an HTTP endpoint and waited for the submission client to send
>>>> resources to it. The server would then run the driver code after it had
>>>> received the resources from the submission client machine. The problem with
>>>> this approach is that the submission client needs to deploy the driver in
>>>> such a way that the driver itself would be reachable from outside of the
>>>> cluster, but it is difficult for an automated framework which is not aware
>>>> of the cluster's configuration to expose an arbitrary pod in a generic way.
>>>> The Service-based design chosen allows a cluster administrator to expose
>>>> the ResourceStagingServer in a manner that makes sense for their
>>>> cluster, such as with an Ingress or with a NodePort service.
>>>> Kubernetes External Shuffle Service
>>>>
>>>> Several alternatives were considered for the design of the shuffle
>>>> service. The first design postulated the use of long-lived executor pods
>>>> and sidecar containers in them running the shuffle service. The advantage
>>>> of this model was that it would let us use emptyDir for sharing as opposed
>>>> to using node local storage, which guarantees better lifecycle management
>>>> of storage by Kubernetes. The apparent disadvantage was that it would be a
>>>> departure from the traditional Spark methodology of keeping executors for
>>>> only as long as required in dynamic allocation mode. It would additionally
>>>> use up more resources than strictly necessary during the course of
>>>> long-running jobs, partially losing the advantage of dynamic scaling.
>>>>
>>>> Another alternative considered was to use a separate shuffle service
>>>> manager as a nameserver. This design has a few drawbacks. First, this means
>>>> another component that needs authentication/authorization management and
>>>> maintenance. Second, this separate component needs to be kept in sync with
>>>> the Kubernetes cluster. Last but not least, most of functionality of this
>>>> separate component can be performed by a combination of the in-cluster
>>>> shuffle service and the Kubernetes API server.
>>>> Pluggable Scheduler Backends
>>>>
>>>> Fully pluggable scheduler backends were considered as a more
>>>> generalized solution, and remain interesting as a possible avenue for
>>>> future-proofing against new scheduling targets.  For the purposes of this
>>>> project, adding a new specialized scheduler backend for Kubernetes was
>>>> chosen as the approach due to its very low impact on the core Spark code;
>>>> making scheduler fully pluggable would be a high-impact high-risk
>>>> modification to Spark’s core libraries. The pluggable scheduler backends
>>>> effort is being tracked in JIRA-19700
>>>> <https://issues.apache.org/jira/browse/SPARK-19700>.
>>>>
>>>>
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe e-mail: [email protected]
>>>>
>>>
>>>
>>
>>
>> --
>> Anirudh Ramanathan
>>
>
>

Re: SPIP: Spark on Kubernetes

Reply via email to