"Premature end of Content-Length" Error

2023-10-19 Thread Sandhya Bala
Hi all,

I am running into the following error with spark 2.4.8

Job aborted due to stage failure: Task 9 in stage 2.0 failed 4 times, most
> recent failure: Lost task 9.3 in stage 2.0 (TID 100, 10.221.8.73, executor
> 2): org.apache.http.ConnectionClosedException: Premature end of
> Content-Length delimited message body (expected: 106327; received: 3477


but my current code doesn't have neither hadoop-aws nor aws-java-sdk.

Any suggestions on what could be the problem?

Thanks,
Sandhya


Re: Re: Running Spark Connect Server in Cluster Mode on Kubernetes

2023-10-19 Thread eab...@163.com
Hi,
I have found three important classes:
org.apache.spark.sql.connect.service.SparkConnectServer : the 
./sbin/start-connect-server.sh script use SparkConnectServer  class as main 
class. In main function, use SparkSession.builder.getOrCreate() create local 
sessin, and start SparkConnectService.
org.apache.spark.sql.connect.SparkConnectPlugin : To enable Spark Connect, 
simply make sure that the appropriate JAR is available in the CLASSPATH and the 
driver plugin is configured to load this class.
org.apache.spark.sql.connect.SimpleSparkConnectService : A simple main class 
method to start the spark connect server as a service for client tests. 

   So, I believe that by configuring the spark.plugins and starting the Spark 
cluster on Kubernetes, clients can utilize sc://ip:port to connect to the 
remote server. 
   Let me give it a try.



eabour
 
From: eab...@163.com
Date: 2023-10-19 14:28
To: Nagatomi Yasukazu; user @spark
Subject: Re: Re: Running Spark Connect Server in Cluster Mode on Kubernetes
Hi all, 
Has the spark connect server running on k8s functionality been implemented?



 
From: Nagatomi Yasukazu
Date: 2023-09-05 17:51
To: user
Subject: Re: Running Spark Connect Server in Cluster Mode on Kubernetes
Dear Spark Community,

I've been exploring the capabilities of the Spark Connect Server and 
encountered an issue when trying to launch it in a cluster deploy mode with 
Kubernetes as the master.

While initiating the `start-connect-server.sh` script with the `--conf` 
parameter for `spark.master` and `spark.submit.deployMode`, I was met with an 
error message:

```
Exception in thread "main" org.apache.spark.SparkException: Cluster deploy mode 
is not applicable to Spark Connect server.
```

This error message can be traced back to Spark's source code here:
https://github.com/apache/spark/blob/6c885a7cf57df328b03308cff2eed814bda156e4/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala#L307

Given my observations, I'm curious about the Spark Connect Server roadmap:

Is there a plan or current conversation to enable Kubernetes as a master in 
Spark Connect Server's cluster deploy mode?

I have tried to gather information from existing JIRA tickets, but have not 
been able to get a definitive answer:

https://issues.apache.org/jira/browse/SPARK-42730
https://issues.apache.org/jira/browse/SPARK-39375
https://issues.apache.org/jira/browse/SPARK-44117

Any thoughts, updates, or references to similar conversations or initiatives 
would be greatly appreciated.

Thank you for your time and expertise!

Best regards,
Yasukazu

2023年9月5日(火) 12:09 Nagatomi Yasukazu :
Hello Mich,
Thank you for your questions. Here are my responses:

> 1. What investigation have you done to show that it is running in local mode?

I have verified through the History Server's Environment tab that:
- "spark.master" is set to local[*]
- "spark.app.id" begins with local-xxx
- "spark.submit.deployMode" is set to local


> 2. who has configured this kubernetes cluster? Is it supplied by a cloud 
> vendor?

Our Kubernetes cluster was set up in an on-prem environment using RKE2( 
https://docs.rke2.io/ ).


> 3. Confirm that you have configured Spark Connect Server correctly for 
> cluster mode. Make sure you specify the cluster manager (e.g., Kubernetes) 
> and other relevant Spark configurations in your Spark job submission.

Based on the Spark Connect documentation I've read, there doesn't seem to be 
any specific settings for cluster mode related to the Spark Connect Server.

Configuration - Spark 3.4.1 Documentation
https://spark.apache.org/docs/3.4.1/configuration.html#spark-connect

Quickstart: Spark Connect — PySpark 3.4.1 documentation
https://spark.apache.org/docs/latest/api/python/getting_started/quickstart_connect.html

Spark Connect Overview - Spark 3.4.1 Documentation
https://spark.apache.org/docs/latest/spark-connect-overview.html

The documentation only suggests running ./sbin/start-connect-server.sh 
--packages org.apache.spark:spark-connect_2.12:3.4.0, leaving me at a loss.


> 4. Can you provide a full spark submit command

Given the nature of Spark Connect, I don't use the spark-submit command. 
Instead, as per the documentation, I can execute workloads using only a Python 
script. For the Spark Connect Server, I have a Kubernetes manifest executing 
"/opt.spark/sbin/start-connect-server.sh --packages 
org.apache.spark:spark-connect_2.12:3.4.0".


> 5. Make sure that the Python client script connecting to Spark Connect Server 
> specifies the cluster mode explicitly, like using --master or --deploy-mode 
> flags when creating a SparkSession.

The Spark Connect Server operates as a Driver, so it isn't possible to specify 
the --master or --deploy-mode flags in the Python client script. If I try, I 
encounter a RuntimeError.

like this:
RuntimeError: Spark master cannot be configured with Spark Connect server; 
however, found URL for Spark Connect [sc://.../]


> 6. Ensure that you have allocated th