[ https://issues.apache.org/jira/browse/SPARK-46702?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Mohamad Haidar updated SPARK-46702: ----------------------------------- Attachment: image-2024-01-12-10-45-30-398.png > Spark Cluster Crashing > ---------------------- > > Key: SPARK-46702 > URL: https://issues.apache.org/jira/browse/SPARK-46702 > Project: Spark > Issue Type: Bug > Components: Spark Core, Spark Docker > Affects Versions: 3.4.0, 3.5.0 > Reporter: Mohamad Haidar > Priority: Blocker > Attachments: CV62A4~1.LOG, cveshv-events-streaming-TRACE (2).zip, > cveshv-events-streaming-core-cp-type2-filter-driver-0_zrdm71bnrt201_1201240818.log, > image-2024-01-12-10-44-45-717.png, image-2024-01-12-10-45-18-905.png, > image-2024-01-12-10-45-30-398.png, image-2024-01-12-10-45-40-397.png, > image-2024-01-12-10-45-50-427.png, > logs_cveshv-events-streaming-core-cp-type2-filter-driver-0_051023_1500_prev.log > > > h3. Description: > h3. 1. We have a spark cluster installed over a k8s cluster with one > driver and multiple executors (120). > h3. 2. We configure our batch duration to 30 seconds. > h3. 3. The Spark Cluster is reading from a 120 partition topic at Kafka > and writing to an hourly index at ElasticSearch. > h3. 4. ES has 30 DataNodes, 1 shard per DataNode for each index. > h3. 5. Configuration of Driver STS is in Appendix. > h3. 6. Thre driver is observed periodically restarting every 10 mins, > although the restart do not necessarily occur each 10mins, but when it > happens it happens each 10 mins. > h3. 7. The restarts frequency increase with the increase of the > throughput. > h3. 8. When the restarts are happening, we see OptionalDataException, > attached > “logs_cveshv-events-streaming-core-cp-type2-filter-driver-0_051023_1500_prev.log” > is the log resulting in a restart of the driver. > h3. Analysis: > # We’ve done a test with 250 K Records/second, and the processing was good > between 15 and 20 seconds. > # We were able to avoid all the restarts by simply disabling liveness checks. > # This resulted in NO RESTARTS to Streaming Core, we tried the above with > two scenarios: > * Speculation Disabled --> After 10 to 20 minutes the batch duration > increased to minutes and eventually processing was very slow, during which, > main error logs observed are about {*}The executor with id 7 exited with exit > code 50(Uncaught exception).{*}, logs at WARN level and TRACE level were > collected: > * {*}WARN{*}: Logs attached > “cveshv-events-streaming-core-cp-type2-filter-driver-0_liveness_300000_failed_120124_0336_2.log” > * {*}TRACE{*}: Logs attached “” > * Speculation Enabled --> the batch duration increased to minutes (big lag) > only after around 2 hours, logs related are > “cveshv-events-streaming-core-cp-type2-filter-driver-0_zrdm71bnrt201_1201240818.log”. > h3. Conclusion: > * The liveness check is failing and thus causing the restarts. > * The logs indicates that there are some unhandled exceptions to executors. > * Issue can be somewhere else as well, below is the liveness check that was > disabled and that was causing the restarts initially every 10 mins after 3 > occurrances. > spark_application_id=$(curl localhost:4040/api/v1/applications | jq '.[0] | > ."id"') #This variable will store the application-ID which will be extracted > from the json data received from the spark consumer > itselfspark_application_id_formatted=$(echo $spark_application_id | sed > 's/^.//' | sed 's/.$//') #This varialbe will the store the formatted > application-ID where the first and the last character will be removed, which > are (") quotes.spark_failed_job_queue_length=$(curl > localhost:4040/api/v1/applications/$spark_application_id_formatted/jobs?status=failed > | jq length) #This variable will store the length of the queue of failed > operations received from the spark consumer itself if [ > "$spark_failed_job_queue_length" -eq "0" ]; then #Checking whether the failed > spark jobs queue length is zero exit 0 #If the length of failed spark > jobs is zero, then the script will return exit code 0, indicating successelse > exit 1 #If the length of failed spark jbs is greater than zero, then > the script will return exit code 1, indicating failurefi > h3. Next Action: > * Please help us identify the RC of the issue, we’ve tried too many > configurations and with 2 different spark versions 3.4 and 3.5 and we’re not > able to avoid the issue. > > Appendix: > > !image-2024-01-12-10-37-50-117.png! > !image-2024-01-12-10-38-22-245.png! > !image-2024-01-12-10-38-15-835.png! > !image-2024-01-12-10-38-34-247.png! -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org