[
https://issues.apache.org/jira/browse/SPARK-46702?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Mohamad Haidar updated SPARK-46702:
-----------------------------------
Attachment:
cveshv-events-streaming-core-cp-type2-filter-driver-0_zrdm71bnrt201_1201240818.log
> Spark Cluster Crashing
> ----------------------
>
> Key: SPARK-46702
> URL: https://issues.apache.org/jira/browse/SPARK-46702
> Project: Spark
> Issue Type: Bug
> Components: Spark Core, Spark Docker
> Affects Versions: 3.4.0, 3.5.0
> Reporter: Mohamad Haidar
> Priority: Blocker
> Attachments: CV62A4~1.LOG,
> cveshv-events-streaming-core-cp-type2-filter-driver-0_zrdm71bnrt201_1201240818.log,
>
> logs_cveshv-events-streaming-core-cp-type2-filter-driver-0_051023_1500_prev.log
>
>
> h3. Description:
> h3. 1. We have a spark cluster installed over a k8s cluster with one
> driver and multiple executors (120).
> h3. 2. We configure our batch duration to 30 seconds.
> h3. 3. The Spark Cluster is reading from a 120 partition topic at Kafka
> and writing to an hourly index at ElasticSearch.
> h3. 4. ES has 30 DataNodes, 1 shard per DataNode for each index.
> h3. 5. Configuration of Driver STS is in Appendix.
> h3. 6. Thre driver is observed periodically restarting every 10 mins,
> although the restart do not necessarily occur each 10mins, but when it
> happens it happens each 10 mins.
> h3. 7. The restarts frequency increase with the increase of the
> throughput.
> h3. 8. When the restarts are happening, we see OptionalDataException,
> attached “” is the log resulting in a restart of the driver.
> h3. Analysis:
> # We’ve done a test with 250 K Records/second, and the processing was good
> between 15 and 20 seconds.
> # We were able to avoid all the restarts by simply disabling liveness checks.
> # This resulted in NO RESTARTS to Streaming Core, we tried the above with
> two scenarios:
> * Speculation Disabled --> After 10 to 20 minutes the batch duration
> increased to minutes and eventually processing was very slow, during which,
> main error logs observed are about {*}The executor with id 7 exited with exit
> code 50(Uncaught exception).{*}, logs at WARN level and TRACE level were
> collected:
> * {*}WARN{*}: Logs attached “”
> * {*}TRACE{*}: Logs attached “”
> * Speculation Enabled --> the batch duration increased to minutes (big lag)
> only after around 2 hours, logs related are “”.
> h3. Conclusion:
> * The liveness check is failing and thus causing the restarts.
> * The logs indicates that there are some unhandled exceptions to executors.
> * Issue can be somewhere else as well, below is the liveness check that was
> disabled and that was causing the restarts initially every 10 mins after 3
> occurrances.
> spark_application_id=$(curl localhost:4040/api/v1/applications | jq '.[0] |
> ."id"') #This variable will store the application-ID which will be extracted
> from the json data received from the spark consumer
> itselfspark_application_id_formatted=$(echo $spark_application_id | sed
> 's/^.//' | sed 's/.$//') #This varialbe will the store the formatted
> application-ID where the first and the last character will be removed, which
> are (") quotes.spark_failed_job_queue_length=$(curl
> localhost:4040/api/v1/applications/$spark_application_id_formatted/jobs?status=failed
> | jq length) #This variable will store the length of the queue of failed
> operations received from the spark consumer itself if [
> "$spark_failed_job_queue_length" -eq "0" ]; then #Checking whether the failed
> spark jobs queue length is zero exit 0 #If the length of failed spark
> jobs is zero, then the script will return exit code 0, indicating successelse
> exit 1 #If the length of failed spark jbs is greater than zero, then
> the script will return exit code 1, indicating failurefi
> h3. Next Action:
> * Please help us identify the RC of the issue, we’ve tried too many
> configurations and with 2 different spark versions 3.4 and 3.5 and we’re not
> able to avoid the issue.
>
> Appendix:
>
> !image-2024-01-12-10-37-50-117.png!
> !image-2024-01-12-10-38-22-245.png!
> !image-2024-01-12-10-38-15-835.png!
> !image-2024-01-12-10-38-34-247.png!
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]