[jira] [Updated] (SPARK-47952) Support retrieving the real SparkConnectService GRPC address and port programmatically when running on Yarn

ASF GitHub Bot (Jira) Tue, 23 Apr 2024 03:36:12 -0700


     [ 
https://issues.apache.org/jira/browse/SPARK-47952?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


ASF GitHub Bot updated SPARK-47952:
-----------------------------------
    Labels: pull-request-available  (was: )

> Support retrieving the real SparkConnectService GRPC address and port 
> programmatically when running on Yarn
> -----------------------------------------------------------------------------------------------------------
>
>                 Key: SPARK-47952
>                 URL: https://issues.apache.org/jira/browse/SPARK-47952
>             Project: Spark
>          Issue Type: Story
>          Components: Connect
>    Affects Versions: 4.0.0
>            Reporter: TakawaAkirayo
>            Priority: Minor
>              Labels: pull-request-available
>
> 1.User Story:
> Our data analysts and data scientists use Jupyter notebooks provisioned on 
> Kubernetes (k8s) with limited CPU/memory resources to run Spark-shell/pyspark 
> in the terminal via Yarn Client mode. However, Yarn Client mode consumes 
> significant local memory if the job is heavy, and the total resource pool of 
> k8s for notebooks is limited. To leverage the abundant resources of our 
> Hadoop cluster for scalability purposes, we aim to utilize SparkConnect. This 
> allows the driver on Yarn with SparkConnectService started and uses 
> SparkConnect client to connect to the remote driver.
> To provide a seamless experience with one command startup for both server and 
> client, we've wrapped the following processes in one script:
> 1) Start a local coordinator server (implemented by us, not in this PR) with 
> a specified port.
> 2) Start SparkConnectServer by spark-submit via Yarn Cluster mode with 
> user-input Spark configurations and the local coordinator server's address 
> and port. Append an additional listener class in the configuration for 
> SparkConnectService callback with the actual address and port on Yarn to the 
> coordinator server.
> 3) Wait for the coordinator server to receive the address callback from the 
> SparkConnectService on Yarn and export the real address.
> 4) Start the client (pyspark --remote) with the remote address.
> Finally, a remote SparkConnect Server is started on Yarn with a local 
> SparkConnect client connected. Users no longer need to start the server 
> beforehand and connect to the remote server after they manually explore the 
> address on Yarn.
> 2.Problem statement of this change:
> 1) The specified port for the SparkConnectService GRPC server might be 
> occupied on the node of the Hadoop Cluster. To increase the success rate of 
> startup, it needs to retry on conflicts rather than fail directly.
> 2) Because the final binding port could be uncertain based on #1 and the 
> remote address is unpredictable on Yarn, we need to retrieve the address and 
> port programmatically and inject it automatically on the start of `pyspark 
> --remote`. The SparkConnectService needs to communicate its location back to 
> the launcher side.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-47952) Support retrieving the real SparkConnectService GRPC address and port programmatically when running on Yarn

Reply via email to