[jira] [Updated] (SPARK-47952) Support retrieving the real SparkConnectService GRPC address and port programmatically when running on Yarn

TakawaAkirayo (Jira) Tue, 23 Apr 2024 00:57:05 -0700


     [ 
https://issues.apache.org/jira/browse/SPARK-47952?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


TakawaAkirayo updated SPARK-47952:
----------------------------------
    Description: 
1.User Story:
Our data analysts and data scientists use Jupyter notebooks provisioned on 
Kubernetes (k8s) with limited CPU/memory resources to run Spark-shell/pyspark 
in the terminal via Yarn Client mode. However, Yarn Client mode consumes 
significant local memory if the job is heavy, and the total resource pool of 
k8s for notebooks is limited. To leverage the abundant resources of our Hadoop 
cluster for scalability purposes, we aim to utilize SparkConnect. This allows 
the driver on Yarn with SparkConnectService started and uses SparkConnect 
client to connect to the remote driver.

To provide a seamless experience with one command startup for both server and 
client, we've wrapped the following processes in one script:

1) Start a local coordinator server (implemented by us, not in this PR) with a 
specified port.
2) Start SparkConnectServer by spark-submit via Yarn Cluster mode with 
user-input Spark configurations and the local coordinator server's address and 
port. Append an additional listener class in the configuration for 
SparkConnectService callback with the actual address and port on Yarn to the 
coordinator server.
3) Wait for the coordinator server to receive the address callback from the 
SparkConnectService on Yarn and export the real address.
4) Start the client (pyspark --remote) with the remote address.

2.Problem statement of this change:
1) The specified port for the SparkConnectService GRPC server might be occupied 
on the node of the Hadoop Cluster. To increase the success rate of startup, it 
needs to retry on conflicts rather than fail directly.
2) Because the final binding port could be uncertain based on #1 and the remote 
address is unpredictable on Yarn, we need to retrieve the address and port 
programmatically and inject it automatically on the start of `pyspark 
--remote`. The SparkConnectService needs to communicate its location back to 
the launcher side.

  was:
User Story:
Our data analysts and data scientists use Jupyter notebooks provisioned on 
Kubernetes (k8s) with limited CPU/memory resources to run Spark-shell/pyspark 
in the terminal via Yarn Client mode. However, Yarn Client mode consumes 
significant local memory if the job is heavy, and the total resource pool of 
k8s for notebooks is limited. To leverage the abundant resources of our Hadoop 
cluster for scalability purposes, we aim to utilize SparkConnect. This allows 
the driver on Yarn with SparkConnectService started and uses SparkConnect 
client to connect to the remote driver.

To provide a seamless experience with one command startup for both server and 
client, we've wrapped the following processes in one script:

1. Start a local coordinator server (implemented by us, not in this PR) with a 
specified port.
2. Start SparkConnectServer by spark-submit via Yarn Cluster mode with 
user-input Spark configurations and the local coordinator server's address and 
port. Append an additional listener class in the configuration for 
SparkConnectService callback with the actual address and port on Yarn to the 
coordinator server.
3. Wait for the coordinator server to receive the address callback from the 
SparkConnectService on Yarn and export the real address.
4. Start the client (pyspark --remote) with the remote address.

Problem statement of this change:
1. The specified port for the SparkConnectService GRPC server might be occupied 
on the node of the Hadoop Cluster. To increase the success rate of startup, it 
needs to retry on conflicts rather than fail directly.
2. Because the final binding port could be uncertain based on #1 and the remote 
address is unpredictable on Yarn, we need to retrieve the address and port 
programmatically and inject it automatically on the start of `pyspark 
--remote`. The SparkConnectService needs to communicate its location back to 
the launcher side.


> Support retrieving the real SparkConnectService GRPC address and port 
> programmatically when running on Yarn
> -----------------------------------------------------------------------------------------------------------
>
>                 Key: SPARK-47952
>                 URL: https://issues.apache.org/jira/browse/SPARK-47952
>             Project: Spark
>          Issue Type: Story
>          Components: Connect
>    Affects Versions: 4.0.0
>            Reporter: TakawaAkirayo
>            Priority: Minor
>
> 1.User Story:
> Our data analysts and data scientists use Jupyter notebooks provisioned on 
> Kubernetes (k8s) with limited CPU/memory resources to run Spark-shell/pyspark 
> in the terminal via Yarn Client mode. However, Yarn Client mode consumes 
> significant local memory if the job is heavy, and the total resource pool of 
> k8s for notebooks is limited. To leverage the abundant resources of our 
> Hadoop cluster for scalability purposes, we aim to utilize SparkConnect. This 
> allows the driver on Yarn with SparkConnectService started and uses 
> SparkConnect client to connect to the remote driver.
> To provide a seamless experience with one command startup for both server and 
> client, we've wrapped the following processes in one script:
> 1) Start a local coordinator server (implemented by us, not in this PR) with 
> a specified port.
> 2) Start SparkConnectServer by spark-submit via Yarn Cluster mode with 
> user-input Spark configurations and the local coordinator server's address 
> and port. Append an additional listener class in the configuration for 
> SparkConnectService callback with the actual address and port on Yarn to the 
> coordinator server.
> 3) Wait for the coordinator server to receive the address callback from the 
> SparkConnectService on Yarn and export the real address.
> 4) Start the client (pyspark --remote) with the remote address.
> 2.Problem statement of this change:
> 1) The specified port for the SparkConnectService GRPC server might be 
> occupied on the node of the Hadoop Cluster. To increase the success rate of 
> startup, it needs to retry on conflicts rather than fail directly.
> 2) Because the final binding port could be uncertain based on #1 and the 
> remote address is unpredictable on Yarn, we need to retrieve the address and 
> port programmatically and inject it automatically on the start of `pyspark 
> --remote`. The SparkConnectService needs to communicate its location back to 
> the launcher side.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-47952) Support retrieving the real SparkConnectService GRPC address and port programmatically when running on Yarn

Reply via email to