[GitHub] spark pull request #19946: [SPARK-22648] [K8S] Spark on Kubernetes - Documen...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/19946 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19946: [SPARK-22648] [K8S] Spark on Kubernetes - Documen...
Github user foxish commented on a diff in the pull request: https://github.com/apache/spark/pull/19946#discussion_r158357822 --- Diff: docs/running-on-yarn.md --- @@ -18,7 +18,8 @@ Spark application's configuration (driver, executors, and the AM when running in There are two deploy modes that can be used to launch Spark applications on YARN. In `cluster` mode, the Spark driver runs inside an application master process which is managed by YARN on the cluster, and the client can go away after initiating the application. In `client` mode, the driver runs in the client process, and the application master is only used for requesting resources from YARN. -Unlike [Spark standalone](spark-standalone.html) and [Mesos](running-on-mesos.html) modes, in which the master's address is specified in the `--master` parameter, in YARN mode the ResourceManager's address is picked up from the Hadoop configuration. Thus, the `--master` parameter is `yarn`. +Unlike other cluster managers supported by Spark +in which the master's address is specified in the `--master` parameter, in YARN mode the ResourceManager's address is picked up from the Hadoop configuration. Thus, the `--master` parameter is `yarn`. --- End diff -- Done --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19946: [SPARK-22648] [K8S] Spark on Kubernetes - Documen...
Github user rxin commented on a diff in the pull request: https://github.com/apache/spark/pull/19946#discussion_r158205893 --- Diff: docs/building-spark.md --- @@ -49,7 +49,7 @@ To create a Spark distribution like those distributed by the to be runnable, use `./dev/make-distribution.sh` in the project root directory. It can be configured with Maven profile settings and so on like the direct Maven build. Example: -./dev/make-distribution.sh --name custom-spark --pip --r --tgz -Psparkr -Phadoop-2.7 -Phive -Phive-thriftserver -Pmesos -Pyarn +./dev/make-distribution.sh --name custom-spark --pip --r --tgz -Psparkr -Phadoop-2.7 -Phive -Phive-thriftserver -Pmesos -Pyarn -Pkubernetes --- End diff -- Yea I don't think you need to block this pr with this. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19946: [SPARK-22648] [K8S] Spark on Kubernetes - Documen...
Github user jiangxb1987 commented on a diff in the pull request: https://github.com/apache/spark/pull/19946#discussion_r158202645 --- Diff: docs/running-on-yarn.md --- @@ -18,7 +18,8 @@ Spark application's configuration (driver, executors, and the AM when running in There are two deploy modes that can be used to launch Spark applications on YARN. In `cluster` mode, the Spark driver runs inside an application master process which is managed by YARN on the cluster, and the client can go away after initiating the application. In `client` mode, the driver runs in the client process, and the application master is only used for requesting resources from YARN. -Unlike [Spark standalone](spark-standalone.html) and [Mesos](running-on-mesos.html) modes, in which the master's address is specified in the `--master` parameter, in YARN mode the ResourceManager's address is picked up from the Hadoop configuration. Thus, the `--master` parameter is `yarn`. +Unlike other cluster managers supported by Spark +in which the master's address is specified in the `--master` parameter, in YARN mode the ResourceManager's address is picked up from the Hadoop configuration. Thus, the `--master` parameter is `yarn`. --- End diff -- nit: why start a new line here? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19946: [SPARK-22648] [K8S] Spark on Kubernetes - Documen...
Github user foxish commented on a diff in the pull request: https://github.com/apache/spark/pull/19946#discussion_r158173626 --- Diff: docs/running-on-kubernetes.md --- @@ -0,0 +1,573 @@ +--- +layout: global +title: Running Spark on Kubernetes +--- +* This will become a table of contents (this text will be scraped). +{:toc} + +Spark can run on clusters managed by [Kubernetes](https://kubernetes.io). This feature makes use of native +Kubernetes scheduler that has been added to Spark. + +# Prerequisites + +* A runnable distribution of Spark 2.3 or above. +* A running Kubernetes cluster at version >= 1.6 with access configured to it using +[kubectl](https://kubernetes.io/docs/user-guide/prereqs/). If you do not already have a working Kubernetes cluster, +you may setup a test cluster on your local machine using +[minikube](https://kubernetes.io/docs/getting-started-guides/minikube/). + * We recommend using the latest release of minikube with the DNS addon enabled. +* You must have appropriate permissions to list, create, edit and delete +[pods](https://kubernetes.io/docs/user-guide/pods/) in your cluster. You can verify that you can list these resources +by running `kubectl auth can-i pods`. + * The service account credentials used by the driver pods must be allowed to create pods, services and configmaps. +* You must have [Kubernetes DNS](https://kubernetes.io/docs/concepts/services-networking/dns-pod-service/) configured in your cluster. + +# How it works + + + + + +spark-submit can be directly used to submit a Spark application to a Kubernetes cluster. +The submission mechanism works as follows: + +* Spark creates a Spark driver running within a [Kubernetes pod](https://kubernetes.io/docs/concepts/workloads/pods/pod/). +* The driver creates executors which are also running within Kubernetes pods and connects to them, and executes application code. +* When the application completes, the executor pods terminate and are cleaned up, but the driver pod persists +logs and remains in "completed" state in the Kubernetes API until it's eventually garbage collected or manually cleaned up. + +Note that in the completed state, the driver pod does *not* use any computational or memory resources. + +The driver and executor pod scheduling is handled by Kubernetes. It will be possible to affect Kubernetes scheduling +decisions for driver and executor pods using advanced primitives like +[node selectors](https://kubernetes.io/docs/concepts/configuration/assign-pod-node/#nodeselector) +and [node/pod affinities](https://kubernetes.io/docs/concepts/configuration/assign-pod-node/#affinity-and-anti-affinity) +in a future release. + +# Submitting Applications to Kubernetes + +## Docker Images + +Kubernetes requires users to supply images that can be deployed into containers within pods. The images are built to +be run in a container runtime environment that Kubernetes supports. Docker is a container runtime environment that is +frequently used with Kubernetes. With Spark 2.3, there are Dockerfiles provided in the runnable distribution that can be customized +and built for your usage. + +You may build these docker images from sources. +There is a script, `sbin/build-push-docker-images.sh` that you can use to build and push +customized Spark distribution images consisting of all the above components. + +Example usage is: + +./sbin/build-push-docker-images.sh -r -t my-tag build +./sbin/build-push-docker-images.sh -r -t my-tag push + +Docker files are under the `kubernetes/dockerfiles/` directory and can be customized further before +building using the supplied script, or manually. + +## Cluster Mode + +To launch Spark Pi in cluster mode, + +{% highlight bash %} +$ bin/spark-submit \ +--master k8s://https://: \ +--deploy-mode cluster \ +--name spark-pi \ --- End diff -- Done --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19946: [SPARK-22648] [K8S] Spark on Kubernetes - Documen...
Github user foxish commented on a diff in the pull request: https://github.com/apache/spark/pull/19946#discussion_r158172980 --- Diff: docs/building-spark.md --- @@ -49,7 +49,7 @@ To create a Spark distribution like those distributed by the to be runnable, use `./dev/make-distribution.sh` in the project root directory. It can be configured with Maven profile settings and so on like the direct Maven build. Example: -./dev/make-distribution.sh --name custom-spark --pip --r --tgz -Psparkr -Phadoop-2.7 -Phive -Phive-thriftserver -Pmesos -Pyarn +./dev/make-distribution.sh --name custom-spark --pip --r --tgz -Psparkr -Phadoop-2.7 -Phive -Phive-thriftserver -Pmesos -Pyarn -Pkubernetes --- End diff -- I've filed https://issues.apache.org/jira/browse/SPARK-22853 to discuss this and unblock this PR. We should be able to reach consensus by release time. :) --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19946: [SPARK-22648] [K8S] Spark on Kubernetes - Documen...
Github user foxish commented on a diff in the pull request: https://github.com/apache/spark/pull/19946#discussion_r158170697 --- Diff: docs/running-on-kubernetes.md --- @@ -0,0 +1,573 @@ +--- +layout: global +title: Running Spark on Kubernetes +--- +* This will become a table of contents (this text will be scraped). +{:toc} + +Spark can run on clusters managed by [Kubernetes](https://kubernetes.io). This feature makes use of native +Kubernetes scheduler that has been added to Spark. + +# Prerequisites + +* A runnable distribution of Spark 2.3 or above. +* A running Kubernetes cluster at version >= 1.6 with access configured to it using +[kubectl](https://kubernetes.io/docs/user-guide/prereqs/). If you do not already have a working Kubernetes cluster, +you may setup a test cluster on your local machine using +[minikube](https://kubernetes.io/docs/getting-started-guides/minikube/). + * We recommend using the latest release of minikube with the DNS addon enabled. +* You must have appropriate permissions to list, create, edit and delete +[pods](https://kubernetes.io/docs/user-guide/pods/) in your cluster. You can verify that you can list these resources +by running `kubectl auth can-i pods`. + * The service account credentials used by the driver pods must be allowed to create pods, services and configmaps. +* You must have [Kubernetes DNS](https://kubernetes.io/docs/concepts/services-networking/dns-pod-service/) configured in your cluster. + +# How it works + + + + + +spark-submit can be directly used to submit a Spark application to a Kubernetes cluster. +The submission mechanism works as follows: + +* Spark creates a Spark driver running within a [Kubernetes pod](https://kubernetes.io/docs/concepts/workloads/pods/pod/). +* The driver creates executors which are also running within Kubernetes pods and connects to them, and executes application code. +* When the application completes, the executor pods terminate and are cleaned up, but the driver pod persists +logs and remains in "completed" state in the Kubernetes API until it's eventually garbage collected or manually cleaned up. + +Note that in the completed state, the driver pod does *not* use any computational or memory resources. + +The driver and executor pod scheduling is handled by Kubernetes. It will be possible to affect Kubernetes scheduling +decisions for driver and executor pods using advanced primitives like +[node selectors](https://kubernetes.io/docs/concepts/configuration/assign-pod-node/#nodeselector) +and [node/pod affinities](https://kubernetes.io/docs/concepts/configuration/assign-pod-node/#affinity-and-anti-affinity) +in a future release. + +# Submitting Applications to Kubernetes + +## Docker Images + +Kubernetes requires users to supply images that can be deployed into containers within pods. The images are built to +be run in a container runtime environment that Kubernetes supports. Docker is a container runtime environment that is +frequently used with Kubernetes. With Spark 2.3, there are Dockerfiles provided in the runnable distribution that can be customized +and built for your usage. + +You may build these docker images from sources. +There is a script, `sbin/build-push-docker-images.sh` that you can use to build and push +customized Spark distribution images consisting of all the above components. + +Example usage is: + +./sbin/build-push-docker-images.sh -r -t my-tag build +./sbin/build-push-docker-images.sh -r -t my-tag push + +Docker files are under the `kubernetes/dockerfiles/` directory and can be customized further before +building using the supplied script, or manually. + +## Cluster Mode + +To launch Spark Pi in cluster mode, + +{% highlight bash %} +$ bin/spark-submit \ +--master k8s://https://: \ +--deploy-mode cluster \ +--name spark-pi \ --- End diff -- Good point, will update with caveat. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org