Re: [CONNECT] Why Can't We Specify Cluster Deploy Mode for Spark Connect?

Prabodh Agarwal Mon, 09 Sep 2024 08:30:11 -0700

Oh. This issue is pretty straightforward to solve actually. Particularly,
in spark-3.5.2.


Just download the `spark-connect` maven jar and place it in
`$SPARK_HOME/jars`. Then rebuild the docker image. I saw that I had posted
a comment on this Jira as well. I could fix this up for standalone cluster
at least this way.

On Mon, Sep 9, 2024 at 7:04 PM Nagatomi Yasukazu <yassan0...@gmail.com>
wrote:

> Hi Prabodh,
>
> Thank you for your response.
>
> As you can see from the following JIRA issue, it is possible to run the
> Spark Connect Driver on Kubernetes:
>
> https://issues.apache.org/jira/browse/SPARK-45769
>
> However, this issue describes a problem that occurs when the Driver and
> Executors are running on different nodes. This could potentially be the
> reason why only Standalone mode is currently supported, but I am not
> certain about it.
>
> Thank you for your attention.
>
>
> 2024年9月9日(月) 12:40 Prabodh Agarwal <prabodh1...@gmail.com>:
>
>> My 2 cents regarding my experience with using spark connect in cluster
>> mode.
>>
>> 1. Create a spark cluster of 2 or more nodes. Make 1 node as master &
>> other nodes as workers. Deploy spark connect pointing to the master node.
>> This works well. The approach is not well documented, but I could figure
>> it out by hit-and-trial.
>> 2. In k8s, by default; we can actually get the executors to run on
>> kubernetes itself. That is pretty straightforward, but the driver continues
>> to run on a local machine. But yeah, I agree as well, making the driver to
>> run on k8s itself would be slick.
>>
>> Thank you.
>>
>>
>> On Mon, Sep 9, 2024 at 6:17 AM Nagatomi Yasukazu <yassan0...@gmail.com>
>> wrote:
>>
>>> Hi All,
>>>
>>> Why is it not possible to specify cluster as the deploy mode for Spark
>>> Connect?
>>>
>>> As discussed in the following thread, it appears that there is an
>>> "arbitrary decision" within spark-submit that "Cluster mode is not
>>> applicable" to Spark Connect.
>>>
>>> GitHub Issue Comment:
>>>
>>> https://github.com/kubeflow/spark-operator/issues/1801#issuecomment-2000494607
>>>
>>> > This will circumvent the submission error you may have gotten if you
>>> tried to just run the SparkConnectServer directly. From my investigation,
>>> that looks to be an arbitrary decision within spark-submit that Cluster
>>> mode is "not applicable" to SparkConnect. Which is sort of true except when
>>> using this operator :)
>>>
>>> I have reviewed the following commit and pull request, but I could not
>>> find any discussion or reason explaining why cluster mode is not available:
>>>
>>> Related Commit:
>>>
>>> https://github.com/apache/spark/commit/11260310f65e1a30f6b00b380350e414609c5fd4
>>>
>>> Related Pull Request:
>>> https://github.com/apache/spark/pull/39928
>>>
>>> This restriction poses a significant obstacle when trying to use Spark
>>> Connect with the Spark Operator. If there is a technical reason for this, I
>>> would like to know more about it. Additionally, if this issue is being
>>> tracked on JIRA or elsewhere, I would appreciate it if you could provide a
>>> link.
>>>
>>> Thank you in advance.
>>>
>>> Best regards,
>>> Yasukazu Nagatomi
>>>
>>

Re: [CONNECT] Why Can't We Specify Cluster Deploy Mode for Spark Connect?

Reply via email to