Hi,
I have found three important classes:
org.apache.spark.sql.connect.service.SparkConnectServer : the
./sbin/start-connect-server.sh script use SparkConnectServer class as main
class. In main function, use SparkSession.builder.getOrCreate() create local
sessin, and start SparkConnectService.
org.apache.spark.sql.connect.SparkConnectPlugin : To enable Spark Connect,
simply make sure that the appropriate JAR is available in the CLASSPATH and the
driver plugin is configured to load this class.
org.apache.spark.sql.connect.SimpleSparkConnectService : A simple main class
method to start the spark connect server as a service for client tests.
So, I believe that by configuring the spark.plugins and starting the Spark
cluster on Kubernetes, clients can utilize sc://ip:port to connect to the
remote server.
Let me give it a try.
eabour
From: eab...@163.com
Date: 2023-10-19 14:28
To: Nagatomi Yasukazu; user @spark
Subject: Re: Re: Running Spark Connect Server in Cluster Mode on Kubernetes
Hi all,
Has the spark connect server running on k8s functionality been implemented?
From: Nagatomi Yasukazu
Date: 2023-09-05 17:51
To: user
Subject: Re: Running Spark Connect Server in Cluster Mode on Kubernetes
Dear Spark Community,
I've been exploring the capabilities of the Spark Connect Server and
encountered an issue when trying to launch it in a cluster deploy mode with
Kubernetes as the master.
While initiating the `start-connect-server.sh` script with the `--conf`
parameter for `spark.master` and `spark.submit.deployMode`, I was met with an
error message:
```
Exception in thread "main" org.apache.spark.SparkException: Cluster deploy mode
is not applicable to Spark Connect server.
```
This error message can be traced back to Spark's source code here:
https://github.com/apache/spark/blob/6c885a7cf57df328b03308cff2eed814bda156e4/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala#L307
Given my observations, I'm curious about the Spark Connect Server roadmap:
Is there a plan or current conversation to enable Kubernetes as a master in
Spark Connect Server's cluster deploy mode?
I have tried to gather information from existing JIRA tickets, but have not
been able to get a definitive answer:
https://issues.apache.org/jira/browse/SPARK-42730
https://issues.apache.org/jira/browse/SPARK-39375
https://issues.apache.org/jira/browse/SPARK-44117
Any thoughts, updates, or references to similar conversations or initiatives
would be greatly appreciated.
Thank you for your time and expertise!
Best regards,
Yasukazu
2023年9月5日(火) 12:09 Nagatomi Yasukazu :
Hello Mich,
Thank you for your questions. Here are my responses:
> 1. What investigation have you done to show that it is running in local mode?
I have verified through the History Server's Environment tab that:
- "spark.master" is set to local[*]
- "spark.app.id" begins with local-xxx
- "spark.submit.deployMode" is set to local
> 2. who has configured this kubernetes cluster? Is it supplied by a cloud
> vendor?
Our Kubernetes cluster was set up in an on-prem environment using RKE2(
https://docs.rke2.io/ ).
> 3. Confirm that you have configured Spark Connect Server correctly for
> cluster mode. Make sure you specify the cluster manager (e.g., Kubernetes)
> and other relevant Spark configurations in your Spark job submission.
Based on the Spark Connect documentation I've read, there doesn't seem to be
any specific settings for cluster mode related to the Spark Connect Server.
Configuration - Spark 3.4.1 Documentation
https://spark.apache.org/docs/3.4.1/configuration.html#spark-connect
Quickstart: Spark Connect — PySpark 3.4.1 documentation
https://spark.apache.org/docs/latest/api/python/getting_started/quickstart_connect.html
Spark Connect Overview - Spark 3.4.1 Documentation
https://spark.apache.org/docs/latest/spark-connect-overview.html
The documentation only suggests running ./sbin/start-connect-server.sh
--packages org.apache.spark:spark-connect_2.12:3.4.0, leaving me at a loss.
> 4. Can you provide a full spark submit command
Given the nature of Spark Connect, I don't use the spark-submit command.
Instead, as per the documentation, I can execute workloads using only a Python
script. For the Spark Connect Server, I have a Kubernetes manifest executing
"/opt.spark/sbin/start-connect-server.sh --packages
org.apache.spark:spark-connect_2.12:3.4.0".
> 5. Make sure that the Python client script connecting to Spark Connect Server
> specifies the cluster mode explicitly, like using --master or --deploy-mode
> flags when creating a SparkSession.
The Spark Connect Server operates as a Driver, so it isn't possible to specify
the --master or --deploy-mode flags in the Python client script. If I try, I
encounter a RuntimeError.
like this:
RuntimeError: Spark master cannot be configured with Spark Connect server;
however, found URL for Spark Connect [sc://.../]
> 6. Ensure that you have allocated th