+1 (non-binding) Regards, Vaquar khan
On Mon, Aug 28, 2017 at 5:09 PM, Erik Erlandson <eerla...@redhat.com> wrote: > > In addition to the engineering & software aspects of the native Kubernetes > community project, we have also worked at building out the community, with > the goal of providing the foundation for sustaining engineering on the > Kubernetes scheduler back-end. That said, I agree 100% with your point > that adding committers with kube-specific experience is good strategy for > increasing review bandwidth to help service PRs from this community. > > On Mon, Aug 28, 2017 at 2:16 PM, Mark Hamstra <m...@clearstorydata.com> > wrote: > >> In my opinion, the fact that there are nearly no changes to spark-core, >>> and most of our changes are additive should go to prove that this adds >>> little complexity to the workflow of the committers. >> >> >> Actually (and somewhat perversely), the otherwise praiseworthy isolation >> of the Kubernetes code does mean that it adds complexity to the workflow of >> the existing Spark committers. I'll reiterate Imran's concerns: The >> existing Spark committers familiar with Spark's scheduler code have >> adequate knowledge of the Standalone and Yarn implementations, and still >> not sufficient coverage of Mesos. Adding k8s code to Spark would mean that >> the progression of that code would start seeing the issues that the Mesos >> code in Spark currently sees: Reviews and commits tend to languish because >> we don't have currently active committers with sufficient knowledge and >> cycles to deal with the Mesos PRs. Some of this is because the PMC needs to >> get back to addressing the issue of adding new Spark committers who do have >> the needed Mesos skills, but that isn't as simple as we'd like because >> ideally a Spark committer has demonstrated skills across a significant >> portion of the Spark code, not just tightly focused on one area (such as >> Mesos or k8s integration.) In short, adding Kubernetes support directly >> into Spark isn't likely (at least in the short-term) to be entirely >> positive for the spark-on-k8s project, since merging of PRs to the >> spark-on-k8s is very likely to be quite slow at least until such time as we >> have k8s-focused Spark committers. If this project does end up getting >> pulled into the Spark codebase, then the PMC will need to start looking at >> bringing in one or more new committers who meet our requirements for such a >> role and responsibility, and who also have k8s skills. The success and pace >> of development of the spark-on-k8s will depend in large measure on the >> PMC's ability to find such new committers. >> >> All that said, I'm +1 if the those currently responsible for the >> spark-on-k8s project still want to bring the code into Spark. >> >> >> On Mon, Aug 21, 2017 at 11:48 AM, Anirudh Ramanathan < >> ramanath...@google.com.invalid> wrote: >> >>> Thank you for your comments Imran. >>> >>> Regarding integration tests, >>> >>> What you inferred from the documentation is correct - >>> Integration tests do not require any prior setup or a Kubernetes cluster >>> to run. Minikube is a single binary that brings up a one-node cluster and >>> exposes the full Kubernetes API. It is actively maintained and kept up to >>> date with the rest of the project. These local integration tests on Jenkins >>> (like the ones with spark-on-yarn), should allow for the committers to >>> merge changes with a high degree of confidence. >>> I will update the proposal to include more information about the extent >>> and kinds of testing we do. >>> >>> As for (b), people on this thread and the set of contributors on our >>> fork are a fairly wide community of contributors and committers who would >>> be involved in the maintenance long-term. It was one of the reasons behind >>> developing separately as a fork. In my opinion, the fact that there are >>> nearly no changes to spark-core, and most of our changes are additive >>> should go to prove that this adds little complexity to the workflow of the >>> committers. >>> >>> Separating out the cluster managers (into an as yet undecided new home) >>> appears far more disruptive and a high risk change for the short term. >>> However, when there is enough community support behind that effort, tracked >>> in 19700 <https://issues.apache.org/jira/browse/SPARK-19700>; and if >>> that is realized in the future, it wouldn't be difficult to switch over >>> Kubernetes, YARN and Mesos to using the pluggable API. Currently, in my >>> opinion, with the integration tests, active users, and a community of >>> maintainers, Spark-on-Kubernetes would add minimal overhead and benefit a >>> large (and growing) class of users. >>> >>> Lastly, the RSS is indeed separate and a value-add that we would love to >>> share with other cluster managers as well. >>> >>> On Mon, Aug 21, 2017 at 10:17 AM, Imran Rashid <iras...@cloudera.com> >>> wrote: >>> >>>> Overall this looks like a good proposal. I do have some concerns which >>>> I'd like to discuss -- please understand I'm taking a "devil's advocate" >>>> stance here for discussion, not that I'm giving a -1. >>>> >>>> My primary concern is about testing and maintenance. My concerns might >>>> be addressed if the doc included a section on testing that might just be >>>> this: https://github.com/apache-spark-on-k8s/spark/blob/branch-2.2 >>>> -kubernetes/resource-managers/kubernetes/README.md#running-t >>>> he-kubernetes-integration-tests >>>> >>>> but without the concerning warning "Note that the integration test >>>> framework is currently being heavily revised and is subject to change". >>>> I'd like the proposal to clearly indicate that some baseline testing can be >>>> done by devs and in spark's regular jenkins builds without special access >>>> to kubernetes clusters. >>>> >>>> Its worth noting that there *are* advantages to keeping it outside >>>> Spark: >>>> * when making changes to spark's scheduler, we do *not* have to worry >>>> about how those changes impact kubernetes. This simplifies things for >>>> those making changes to spark >>>> * making changes changes to the kubernetes integration is not blocked >>>> by getting enough attention from spark's committers >>>> >>>> or in other words, each community of experts can maintain its focus. I >>>> have these concerns based on past experience with the mesos integration -- >>>> mesos contributors are blocked on committers reviewing their changes, and >>>> then committers have no idea how to test that the changes are correct, and >>>> find it hard to even learn the ins and outs of that code without access to >>>> a mesos cluster. >>>> >>>> The same could be said for the yarn integration, but I think its helped >>>> that (a) spark-on-yarn *does* have local tests for testing basic >>>> integration and (b) there is a sufficient community of contributors and >>>> committers for spark-on-yarn. I realize (b) is a chicken-and-egg problem, >>>> but I'd like to be sure that at least (a) is addressed. (and maybe even >>>> spark-on-yarn shouldln't be inside spark itself, as mridul said, but its >>>> not clear what the other home should be.) >>>> >>>> At some point, this is just a judgement call, of the value it brings to >>>> the spark community vs the added complexity. I'm willing to believe that >>>> kubernetes will bring enough value to make this worthwhile, just voicing my >>>> concerns. >>>> >>>> Secondary concern: >>>> the RSS doesn't seem necessary for kubernetes support, or specific to >>>> it. If its nice to have, and you want to add it to kubernetes first before >>>> other cluster managers, fine, but seems separate from this proposal. >>>> >>>> >>>> >>>> On Tue, Aug 15, 2017 at 10:32 AM, Anirudh Ramanathan < >>>> fox...@google.com.invalid> wrote: >>>> >>>>> Spark on Kubernetes effort has been developed separately in a fork, >>>>> and linked back from the Apache Spark project as an experimental >>>>> backend >>>>> <http://spark.apache.org/docs/latest/cluster-overview.html#cluster-manager-types>. >>>>> We're ~6 months in, have had 5 releases >>>>> <https://github.com/apache-spark-on-k8s/spark/releases>. >>>>> >>>>> - 2 Spark versions maintained (2.1, and 2.2) >>>>> - Extensive integration testing and refactoring efforts to >>>>> maintain code quality >>>>> - Developer >>>>> <https://github.com/apache-spark-on-k8s/spark#getting-started> and >>>>> user-facing <https://apache-spark-on-k8s.github.io/userdocs/> docu >>>>> mentation >>>>> - 10+ consistent code contributors from different organizations >>>>> >>>>> <https://apache-spark-on-k8s.github.io/userdocs/contribute.html#project-contributions> >>>>> involved >>>>> in actively maintaining and using the project, with several more >>>>> members >>>>> involved in testing and providing feedback. >>>>> - The community has delivered several talks on Spark-on-Kubernetes >>>>> generating lots of feedback from users. >>>>> - In addition to these, we've seen efforts spawn off such as: >>>>> - HDFS on Kubernetes >>>>> <https://github.com/apache-spark-on-k8s/kubernetes-HDFS> with >>>>> Locality and Performance Experiments >>>>> - Kerberized access >>>>> >>>>> <https://docs.google.com/document/d/1RBnXD9jMDjGonOdKJ2bA1lN4AAV_1RwpU_ewFuCNWKg/edit> >>>>> to >>>>> HDFS from Spark running on Kubernetes >>>>> >>>>> *Following the SPIP process, I'm putting this SPIP up for a vote.* >>>>> >>>>> - +1: Yeah, let's go forward and implement the SPIP. >>>>> - +0: Don't really care. >>>>> - -1: I don't think this is a good idea because of the following >>>>> technical reasons. >>>>> >>>>> If there is any further clarification desired, on the design or the >>>>> implementation, please feel free to ask questions or provide feedback. >>>>> >>>>> >>>>> SPIP: Kubernetes as A Native Cluster Manager >>>>> >>>>> Full Design Doc: link >>>>> <https://issues.apache.org/jira/secure/attachment/12881586/SPARK-18278%20Spark%20on%20Kubernetes%20Design%20Proposal%20Revision%202%20%281%29.pdf> >>>>> >>>>> JIRA: https://issues.apache.org/jira/browse/SPARK-18278 >>>>> >>>>> Kubernetes Issue: https://github.com/kubernetes/ >>>>> kubernetes/issues/34377 >>>>> >>>>> Authors: Yinan Li, Anirudh Ramanathan, Erik Erlandson, Andrew Ash, >>>>> Matt Cheah, >>>>> >>>>> Ilan Filonenko, Sean Suchter, Kimoon Kim >>>>> Background and Motivation >>>>> >>>>> Containerization and cluster management technologies are constantly >>>>> evolving in the cluster computing world. Apache Spark currently implements >>>>> support for Apache Hadoop YARN and Apache Mesos, in addition to providing >>>>> its own standalone cluster manager. In 2014, Google announced development >>>>> of Kubernetes <https://kubernetes.io/> which has its own unique >>>>> feature set and differentiates itself from YARN and Mesos. Since its >>>>> debut, >>>>> it has seen contributions from over 1300 contributors with over 50000 >>>>> commits. Kubernetes has cemented itself as a core player in the cluster >>>>> computing world, and cloud-computing providers such as Google Container >>>>> Engine, Google Compute Engine, Amazon Web Services, and Microsoft Azure >>>>> support running Kubernetes clusters. >>>>> >>>>> This document outlines a proposal for integrating Apache Spark with >>>>> Kubernetes in a first class way, adding Kubernetes to the list of cluster >>>>> managers that Spark can be used with. Doing so would allow users to share >>>>> their computing resources and containerization framework between their >>>>> existing applications on Kubernetes and their computational Spark >>>>> applications. Although there is existing support for running a Spark >>>>> standalone cluster on Kubernetes >>>>> <https://github.com/kubernetes/examples/blob/master/staging/spark/README.md>, >>>>> there are still major advantages and significant interest in having native >>>>> execution support. For example, this integration provides better support >>>>> for multi-tenancy and dynamic resource allocation. It also allows users to >>>>> run applications of different Spark versions of their choices in the same >>>>> cluster. >>>>> >>>>> The feature is being developed in a separate fork >>>>> <https://github.com/apache-spark-on-k8s/spark> in order to minimize >>>>> risk to the main project during development. Since the start of the >>>>> development in November of 2016, it has received over 100 commits from >>>>> over >>>>> 20 contributors and supports two releases based on Spark 2.1 and 2.2 >>>>> respectively. Documentation is also being actively worked on both in the >>>>> main project repository and also in the repository >>>>> https://github.com/apache-spark-on-k8s/userdocs. Regarding real-world >>>>> use cases, we have seen cluster setup that uses 1000+ cores. We are also >>>>> seeing growing interests on this project from more and more organizations. >>>>> >>>>> While it is easy to bootstrap the project in a forked repository, it >>>>> is hard to maintain it in the long run because of the tricky process of >>>>> rebasing onto the upstream and lack of awareness in the large Spark >>>>> community. It would be beneficial to both the Spark and Kubernetes >>>>> community seeing this feature being merged upstream. On one hand, it gives >>>>> Spark users the option of running their Spark workloads along with other >>>>> workloads that may already be running on Kubernetes, enabling better >>>>> resource sharing and isolation, and better cluster administration. On the >>>>> other hand, it gives Kubernetes a leap forward in the area of large-scale >>>>> data processing by being an officially supported cluster manager for >>>>> Spark. >>>>> The risk of merging into upstream is low because most of the changes are >>>>> purely incremental, i.e., new Kubernetes-aware implementations of existing >>>>> interfaces/classes in Spark core are introduced. The development is also >>>>> concentrated in a single place at resource-managers/kubernetes >>>>> <https://github.com/apache-spark-on-k8s/spark/tree/branch-2.2-kubernetes/resource-managers/kubernetes>. >>>>> The risk is further reduced by a comprehensive integration test framework, >>>>> and an active and responsive community of future maintainers. >>>>> Target Personas >>>>> >>>>> Devops, data scientists, data engineers, application developers, >>>>> anyone who can benefit from having Kubernetes >>>>> <https://kubernetes.io/docs/concepts/overview/what-is-kubernetes/> as >>>>> a native cluster manager for Spark. >>>>> Goals >>>>> >>>>> - >>>>> >>>>> Make Kubernetes a first-class cluster manager for Spark, alongside >>>>> Spark Standalone, Yarn, and Mesos. >>>>> - >>>>> >>>>> Support both client and cluster deployment mode. >>>>> - >>>>> >>>>> Support dynamic resource allocation >>>>> >>>>> <http://spark.apache.org/docs/latest/job-scheduling.html#dynamic-resource-allocation> >>>>> . >>>>> - >>>>> >>>>> Support Spark Java/Scala, PySpark, and Spark R applications. >>>>> - >>>>> >>>>> Support secure HDFS access. >>>>> - >>>>> >>>>> Allow running applications of different Spark versions in the same >>>>> cluster through the ability to specify the driver and executor Docker >>>>> images on a per-application basis. >>>>> - >>>>> >>>>> Support specification and enforcement of limits on both CPU cores >>>>> and memory. >>>>> >>>>> Non-Goals >>>>> >>>>> - >>>>> >>>>> Support cluster resource scheduling and sharing beyond >>>>> capabilities offered natively by the Kubernetes per-namespace resource >>>>> quota model. >>>>> >>>>> Proposed API Changes >>>>> >>>>> Most API changes are purely incremental, i.e., new Kubernetes-aware >>>>> implementations of existing interfaces/classes in Spark core are >>>>> introduced. Detailed changes are as follows. >>>>> >>>>> - >>>>> >>>>> A new cluster manager option KUBERNETES is introduced and some >>>>> changes are made to SparkSubmit to make it be aware of this >>>>> option. >>>>> - >>>>> >>>>> A new implementation of CoarseGrainedSchedulerBackend, namely >>>>> KubernetesClusterSchedulerBackend is responsible for managing the >>>>> creation and deletion of executor Pods through the Kubernetes API. >>>>> - >>>>> >>>>> A new implementation of TaskSchedulerImpl, namely >>>>> KubernetesTaskSchedulerImpl, and a new implementation of >>>>> TaskSetManager, namely Kubernetes TaskSetManager, are introduced >>>>> for Kubernetes-aware task scheduling. >>>>> - >>>>> >>>>> When dynamic resource allocation is enabled, a new implementation >>>>> of ExternalShuffleService, namely KubernetesExternalShuffleService >>>>> is introduced. >>>>> >>>>> Design Sketch >>>>> >>>>> Below we briefly describe the design. For more details on the design >>>>> and architecture, please refer to the architecture documentation >>>>> <https://github.com/apache-spark-on-k8s/spark/tree/branch-2.2-kubernetes/resource-managers/kubernetes/architecture-docs>. >>>>> The main idea of this design is to run Spark driver and executors inside >>>>> Kubernetes Pods >>>>> <https://kubernetes.io/docs/concepts/workloads/pods/pod/>. Pods are a >>>>> co-located and co-scheduled group of one or more containers run in a >>>>> shared >>>>> context. The driver is responsible for creating and destroying executor >>>>> Pods through the Kubernetes API, while Kubernetes is fully responsible for >>>>> scheduling the Pods to run on available nodes in the cluster. In the >>>>> cluster mode, the driver also runs in a Pod in the cluster, created >>>>> through >>>>> the Kubernetes API by a Kubernetes-aware submission client called by the >>>>> spark-submit script. Because the driver runs in a Pod, it is >>>>> reachable by the executors in the cluster using its Pod IP. In the client >>>>> mode, the driver runs outside the cluster and calls the Kubernetes API to >>>>> create and destroy executor Pods. The driver must be routable from within >>>>> the cluster for the executors to communicate with it. >>>>> >>>>> The main component running in the driver is the >>>>> KubernetesClusterSchedulerBackend, an implementation of >>>>> CoarseGrainedSchedulerBackend, which manages allocating and >>>>> destroying executors via the Kubernetes API, as instructed by Spark core >>>>> via calls to methods doRequestTotalExecutors and doKillExecutors, >>>>> respectively. Within the KubernetesClusterSchedulerBackend, a >>>>> separate kubernetes-pod-allocator thread handles the creation of new >>>>> executor Pods with appropriate throttling and monitoring. Throttling is >>>>> achieved using a feedback loop that makes decision on submitting new >>>>> requests for executors based on whether previous executor Pod creation >>>>> requests have completed. This indirection is necessary because the >>>>> Kubernetes API server accepts requests for new Pods optimistically, with >>>>> the anticipation of being able to eventually schedule them to run. >>>>> However, >>>>> it is undesirable to have a very large number of Pods that cannot be >>>>> scheduled and stay pending within the cluster. The throttling mechanism >>>>> gives us control over how fast an application scales up (which can be >>>>> configured), and helps prevent Spark applications from DOS-ing the >>>>> Kubernetes API server with too many Pod creation requests. The executor >>>>> Pods simply run the CoarseGrainedExecutorBackend class from a >>>>> pre-built Docker image that contains a Spark distribution. >>>>> >>>>> There are auxiliary and optional components: ResourceStagingServer >>>>> and KubernetesExternalShuffleService, which serve specific purposes >>>>> described below. The ResourceStagingServer serves as a file store (in >>>>> the absence of a persistent storage layer in Kubernetes) for application >>>>> dependencies uploaded from the submission client machine, which then get >>>>> downloaded from the server by the init-containers in the driver and >>>>> executor Pods. It is a Jetty server with JAX-RS and has two endpoints for >>>>> uploading and downloading files, respectively. Security tokens are >>>>> returned >>>>> in the responses for file uploading and must be carried in the requests >>>>> for >>>>> downloading the files. The ResourceStagingServer is deployed as a >>>>> Kubernetes Service >>>>> <https://kubernetes.io/docs/concepts/services-networking/service/> >>>>> backed by a Deployment >>>>> <https://kubernetes.io/docs/concepts/workloads/controllers/deployment/> >>>>> in the cluster and multiple instances may be deployed in the same cluster. >>>>> Spark applications specify which ResourceStagingServer instance to >>>>> use through a configuration property. >>>>> >>>>> The KubernetesExternalShuffleService is used to support dynamic >>>>> resource allocation, with which the number of executors of a Spark >>>>> application can change at runtime based on the resource needs. It provides >>>>> an additional endpoint for drivers that allows the shuffle service to >>>>> delete driver termination and clean up the shuffle files associated with >>>>> corresponding application. There are two ways of deploying the >>>>> KubernetesExternalShuffleService: running a shuffle service Pod on >>>>> each node in the cluster or a subset of the nodes using a DaemonSet >>>>> <https://kubernetes.io/docs/concepts/workloads/controllers/daemonset/>, >>>>> or running a shuffle service container in each of the executor Pods. In >>>>> the >>>>> first option, each shuffle service container mounts a hostPath >>>>> <https://kubernetes.io/docs/concepts/storage/volumes/#hostpath> >>>>> volume. The same hostPath volume is also mounted by each of the executor >>>>> containers, which must also have the environment variable >>>>> SPARK_LOCAL_DIRS point to the hostPath. In the second option, a >>>>> shuffle service container is co-located with an executor container in each >>>>> of the executor Pods. The two containers share an emptyDir >>>>> <https://kubernetes.io/docs/concepts/storage/volumes/#emptydir> >>>>> volume where the shuffle data gets written to. There may be multiple >>>>> instances of the shuffle service deployed in a cluster that may be used >>>>> for >>>>> different versions of Spark, or for different priority levels with >>>>> different resource quotas. >>>>> >>>>> New Kubernetes-specific configuration options are also introduced to >>>>> facilitate specification and customization of driver and executor Pods and >>>>> related Kubernetes resources. For example, driver and executor Pods can be >>>>> created in a particular Kubernetes namespace and on a particular set of >>>>> the >>>>> nodes in the cluster. Users are allowed to apply labels and annotations to >>>>> the driver and executor Pods. >>>>> >>>>> Additionally, secure HDFS support is being actively worked on >>>>> following the design here >>>>> <https://docs.google.com/document/d/1RBnXD9jMDjGonOdKJ2bA1lN4AAV_1RwpU_ewFuCNWKg/edit>. >>>>> Both short-running jobs and long-running jobs that need periodic >>>>> delegation >>>>> token refresh are supported, leveraging built-in Kubernetes constructs >>>>> like >>>>> Secrets. Please refer to the design doc for details. >>>>> Rejected DesignsResource Staging by the Driver >>>>> >>>>> A first implementation effectively included the ResourceStagingServer >>>>> in the driver container itself. The driver container ran a custom command >>>>> that opened an HTTP endpoint and waited for the submission client to send >>>>> resources to it. The server would then run the driver code after it had >>>>> received the resources from the submission client machine. The problem >>>>> with >>>>> this approach is that the submission client needs to deploy the driver in >>>>> such a way that the driver itself would be reachable from outside of the >>>>> cluster, but it is difficult for an automated framework which is not aware >>>>> of the cluster's configuration to expose an arbitrary pod in a generic >>>>> way. >>>>> The Service-based design chosen allows a cluster administrator to expose >>>>> the ResourceStagingServer in a manner that makes sense for their >>>>> cluster, such as with an Ingress or with a NodePort service. >>>>> Kubernetes External Shuffle Service >>>>> >>>>> Several alternatives were considered for the design of the shuffle >>>>> service. The first design postulated the use of long-lived executor pods >>>>> and sidecar containers in them running the shuffle service. The advantage >>>>> of this model was that it would let us use emptyDir for sharing as opposed >>>>> to using node local storage, which guarantees better lifecycle management >>>>> of storage by Kubernetes. The apparent disadvantage was that it would be a >>>>> departure from the traditional Spark methodology of keeping executors for >>>>> only as long as required in dynamic allocation mode. It would additionally >>>>> use up more resources than strictly necessary during the course of >>>>> long-running jobs, partially losing the advantage of dynamic scaling. >>>>> >>>>> Another alternative considered was to use a separate shuffle service >>>>> manager as a nameserver. This design has a few drawbacks. First, this >>>>> means >>>>> another component that needs authentication/authorization management and >>>>> maintenance. Second, this separate component needs to be kept in sync with >>>>> the Kubernetes cluster. Last but not least, most of functionality of this >>>>> separate component can be performed by a combination of the in-cluster >>>>> shuffle service and the Kubernetes API server. >>>>> Pluggable Scheduler Backends >>>>> >>>>> Fully pluggable scheduler backends were considered as a more >>>>> generalized solution, and remain interesting as a possible avenue for >>>>> future-proofing against new scheduling targets. For the purposes of this >>>>> project, adding a new specialized scheduler backend for Kubernetes was >>>>> chosen as the approach due to its very low impact on the core Spark code; >>>>> making scheduler fully pluggable would be a high-impact high-risk >>>>> modification to Spark’s core libraries. The pluggable scheduler backends >>>>> effort is being tracked in JIRA-19700 >>>>> <https://issues.apache.org/jira/browse/SPARK-19700>. >>>>> >>>>> >>>>> >>>>> --------------------------------------------------------------------- >>>>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org >>>>> >>>> >>>> >>> >>> >>> -- >>> Anirudh Ramanathan >>> >> >> > -- Regards, Vaquar Khan +1 -224-436-0783 Greater Chicago