[ https://issues.apache.org/jira/browse/SPARK-47952?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
ASF GitHub Bot updated SPARK-47952: ----------------------------------- Labels: pull-request-available (was: ) > Support retrieving the real SparkConnectService GRPC address and port > programmatically when running on Yarn > ----------------------------------------------------------------------------------------------------------- > > Key: SPARK-47952 > URL: https://issues.apache.org/jira/browse/SPARK-47952 > Project: Spark > Issue Type: Story > Components: Connect > Affects Versions: 4.0.0 > Reporter: TakawaAkirayo > Priority: Minor > Labels: pull-request-available > > 1.User Story: > Our data analysts and data scientists use Jupyter notebooks provisioned on > Kubernetes (k8s) with limited CPU/memory resources to run Spark-shell/pyspark > in the terminal via Yarn Client mode. However, Yarn Client mode consumes > significant local memory if the job is heavy, and the total resource pool of > k8s for notebooks is limited. To leverage the abundant resources of our > Hadoop cluster for scalability purposes, we aim to utilize SparkConnect. This > allows the driver on Yarn with SparkConnectService started and uses > SparkConnect client to connect to the remote driver. > To provide a seamless experience with one command startup for both server and > client, we've wrapped the following processes in one script: > 1) Start a local coordinator server (implemented by us, not in this PR) with > a specified port. > 2) Start SparkConnectServer by spark-submit via Yarn Cluster mode with > user-input Spark configurations and the local coordinator server's address > and port. Append an additional listener class in the configuration for > SparkConnectService callback with the actual address and port on Yarn to the > coordinator server. > 3) Wait for the coordinator server to receive the address callback from the > SparkConnectService on Yarn and export the real address. > 4) Start the client (pyspark --remote) with the remote address. > Finally, a remote SparkConnect Server is started on Yarn with a local > SparkConnect client connected. Users no longer need to start the server > beforehand and connect to the remote server after they manually explore the > address on Yarn. > 2.Problem statement of this change: > 1) The specified port for the SparkConnectService GRPC server might be > occupied on the node of the Hadoop Cluster. To increase the success rate of > startup, it needs to retry on conflicts rather than fail directly. > 2) Because the final binding port could be uncertain based on #1 and the > remote address is unpredictable on Yarn, we need to retrieve the address and > port programmatically and inject it automatically on the start of `pyspark > --remote`. The SparkConnectService needs to communicate its location back to > the launcher side. -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org