[jira] [Updated] (SPARK-46702) Spark Cluster Crashing

Mohamad Haidar (Jira) Fri, 12 Jan 2024 07:46:08 -0800


     [ 
https://issues.apache.org/jira/browse/SPARK-46702?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Mohamad Haidar updated SPARK-46702:
-----------------------------------
    Attachment: image-2024-01-12-10-45-30-398.png

> Spark Cluster Crashing
> ----------------------
>
>                 Key: SPARK-46702
>                 URL: https://issues.apache.org/jira/browse/SPARK-46702
>             Project: Spark
>          Issue Type: Bug
>          Components: Spark Core, Spark Docker
>    Affects Versions: 3.4.0, 3.5.0
>            Reporter: Mohamad Haidar
>            Priority: Blocker
>         Attachments: CV62A4~1.LOG, cveshv-events-streaming-TRACE (2).zip, 
> cveshv-events-streaming-core-cp-type2-filter-driver-0_zrdm71bnrt201_1201240818.log,
>  image-2024-01-12-10-44-45-717.png, image-2024-01-12-10-45-18-905.png, 
> image-2024-01-12-10-45-30-398.png, image-2024-01-12-10-45-40-397.png, 
> image-2024-01-12-10-45-50-427.png, 
> logs_cveshv-events-streaming-core-cp-type2-filter-driver-0_051023_1500_prev.log
>
>
> h3. Description:
> h3. 1.       We have a spark cluster installed over a k8s cluster with one 
> driver and multiple executors (120).
> h3. 2.       We configure our batch duration to 30 seconds.
> h3. 3.       The Spark Cluster is reading from a 120 partition topic at Kafka 
> and writing to an hourly index at ElasticSearch.
> h3. 4.       ES has 30 DataNodes, 1 shard per DataNode for each index.
> h3. 5.       Configuration of Driver STS is in Appendix.
> h3. 6.       Thre driver is observed periodically restarting every 10 mins, 
> although the restart do not necessarily occur each 10mins, but when it 
> happens it happens each 10 mins.
> h3. 7.       The restarts frequency increase with the increase of the 
> throughput.
> h3. 8.       When the restarts are happening, we see OptionalDataException, 
> attached 
> “logs_cveshv-events-streaming-core-cp-type2-filter-driver-0_051023_1500_prev.log”
>  is the log resulting in a restart of the driver.
> h3. Analysis:
>  # We’ve done a test with 250 K Records/second, and the processing was good 
> between 15 and 20 seconds.
>  # We were able to avoid all the restarts by simply disabling liveness checks.
>  # This resulted in NO RESTARTS to Streaming Core, we tried the above with 
> two scenarios:
>  * Speculation Disabled --> After 10 to 20 minutes the batch duration 
> increased to minutes and eventually processing was very slow, during which, 
> main error logs observed are about {*}The executor with id 7 exited with exit 
> code 50(Uncaught exception).{*}, logs at WARN level and TRACE level were 
> collected:
>  * {*}WARN{*}: Logs attached 
> “cveshv-events-streaming-core-cp-type2-filter-driver-0_liveness_300000_failed_120124_0336_2.log”
>  * {*}TRACE{*}: Logs attached “”
>  * Speculation Enabled -->  the batch duration increased to minutes (big lag) 
> only after around 2 hours, logs related are 
> “cveshv-events-streaming-core-cp-type2-filter-driver-0_zrdm71bnrt201_1201240818.log”.
> h3. Conclusion:
>  * The liveness check is failing and thus causing the restarts.
>  * The logs indicates that there are some unhandled exceptions to executors.
>  * Issue can be somewhere else as well, below is the liveness check that was 
> disabled and that was causing the restarts initially every 10 mins after 3 
> occurrances.
> spark_application_id=$(curl localhost:4040/api/v1/applications | jq '.[0] | 
> ."id"') #This variable will store the application-ID which will be extracted 
> from the json data received from the spark consumer 
> itselfspark_application_id_formatted=$(echo $spark_application_id | sed 
> 's/^.//' | sed 's/.$//') #This varialbe will the store the formatted 
> application-ID where the first and the last character will be removed, which 
> are (") quotes.spark_failed_job_queue_length=$(curl 
> localhost:4040/api/v1/applications/$spark_application_id_formatted/jobs?status=failed
>  | jq length) #This variable will store the length of the queue of failed 
> operations received from the spark consumer itself if [ 
> "$spark_failed_job_queue_length" -eq "0" ]; then #Checking whether the failed 
> spark jobs queue length is zero       exit 0 #If the length of failed spark 
> jobs is zero, then the script will return exit code 0, indicating successelse 
>       exit 1 #If the length of failed spark jbs is greater than zero, then 
> the script will return exit code 1, indicating failurefi
> h3. Next Action:
>  * Please help us identify the RC of the issue, we’ve tried too many 
> configurations and with 2 different spark versions 3.4 and 3.5 and we’re not 
> able to avoid the issue.
>  
> Appendix:
>  
> !image-2024-01-12-10-37-50-117.png!
> !image-2024-01-12-10-38-22-245.png!
> !image-2024-01-12-10-38-15-835.png!
> !image-2024-01-12-10-38-34-247.png!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-46702) Spark Cluster Crashing

Reply via email to