This is an automated email from the ASF dual-hosted git repository. mmack pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/beam.git
The following commit(s) were added to refs/heads/master by this push: new 0b4302e5f95 [Documentation] Update docs to run SparkPipelineRunner on a Kubernetes cluster (closes #27984) 0b4302e5f95 is described below commit 0b4302e5f95f2dc9b6658c13d5d1aa798cfba668 Author: Hao Xu <sduxu...@gmail.com> AuthorDate: Fri Sep 1 05:33:12 2023 -0700 [Documentation] Update docs to run SparkPipelineRunner on a Kubernetes cluster (closes #27984) --- .../site/content/en/documentation/runners/spark.md | 47 +++++++++++++++++++++- 1 file changed, 45 insertions(+), 2 deletions(-) diff --git a/website/www/site/content/en/documentation/runners/spark.md b/website/www/site/content/en/documentation/runners/spark.md index dcc166873dc..29ef5c28102 100644 --- a/website/www/site/content/en/documentation/runners/spark.md +++ b/website/www/site/content/en/documentation/runners/spark.md @@ -487,5 +487,48 @@ Provided SparkContext and StreamingListeners are not supported on the Spark port {{< /paragraph >}} ### Kubernetes - -An [example](https://github.com/cometta/python-apache-beam-spark) of configuring Spark to run Apache beam job +#### Submit beam job without job server +To submit a beam job directly on spark kubernetes cluster without spinning up an extra job server, you can do: +``` +spark-submit --master MASTER_URL \ + --conf spark.kubernetes.driver.podTemplateFile=driver_pod_template.yaml \ + --conf spark.kubernetes.executor.podTemplateFile=executor_pod_template.yaml \ + --class org.apache.beam.runners.spark.SparkPipelineRunner \ + --conf spark.kubernetes.container.image=apache/spark:v3.3.2 \ + ./wc_job.jar +``` +Similar to run the beam job on Dataproc, you can bundle the job jar like below. The example use the `PROCESS` type of [SDK harness](https://beam.apache.org/documentation/runtime/sdk-harness-config/) to execute the job by processes. +``` +python -m beam_example_wc \ + --runner=SparkRunner \ + --output_executable_path=./wc_job.jar \ + --environment_type=PROCESS \ + --environment_config='{\"command\": \"/opt/apache/beam/boot\"}' \ + --spark_version=3 +``` + +And below is an example of kubernetes executor pod template, the `initContainer` is required to download the beam SDK harness to run the beam pipelines. +``` +spec: + containers: + - name: spark-kubernetes-executor + volumeMounts: + - name: beam-data + mountPath: /opt/apache/beam/ + initContainers: + - name: init-beam + image: apache/beam_python3.7_sdk + command: + - cp + - /opt/apache/beam/boot + - /init-container/data/boot + volumeMounts: + - name: beam-data + mountPath: /init-container/data + volumes: + - name: beam-data + emptyDir: {} +``` + +#### Submit beam job with job server +An [example](https://github.com/cometta/python-apache-beam-spark) of configuring Spark to run Apache beam job with a job server.