[ 
https://issues.apache.org/jira/browse/SPARK-44542?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

YE updated SPARK-44542:
-----------------------
    Attachment: image-2023-07-25-16-46-42-522.png

> easily load SparkExitCode class in SparkUncaughtExceptionHandler
> ----------------------------------------------------------------
>
>                 Key: SPARK-44542
>                 URL: https://issues.apache.org/jira/browse/SPARK-44542
>             Project: Spark
>          Issue Type: Improvement
>          Components: Spark Core
>    Affects Versions: 3.1.3, 3.3.2, 3.4.1
>            Reporter: YE
>            Priority: Major
>         Attachments: image-2023-07-25-16-46-03-989.png, 
> image-2023-07-25-16-46-28-158.png, image-2023-07-25-16-46-42-522.png
>
>
> There are two background for this improvement proposal:
> 1. When running spark on yarn, the disk might be corrupted during application 
> running. The corrupted disk might contain the spark jars(cache archive from 
> spark.yarn.archive). In that case , the executor JVM cannot load any spark 
> related classes any more.
> 2. Spark leverages the OutputCommitCoordinator to avoid data race between 
> speculate tasks so that no tasks could commit the same partition in the same 
> time. In other words, once a task's commit request is allowed, other commit 
> requests would be denied until the committing task is failed.
>  
> We encountered a corner case combined the above two cases, which makes the 
> spark hangs.  A short timeline could be described as below:
>  # task 5372(tid: 21662) starts running in 21:55
>  # the disk contains the spark archive for that task/executor is corrupted, 
> thus making the archive inaccessible from executor's JVM perspective, it 
> happened around 22:00
>  # the task continues running, at 22:05, it requests commit from coordinator 
> and performs the commit. 
>  # however due the corrupted disk, some exception raised in the executor JVM.
>  # The SparkUncaughtExceptionHandler kicks in, however as the jar/disk is 
> corrupted, the handler itself throws an exception, and the halt process 
> throws an exception too.
>  # The executor is hanging there, no more tasks are running. However the 
> authorized commit request is still valid in the driver side
>  # Speculate tasks start to click in, due to no commit permission, all 
> speculate tasks are killed/denied.
>  # The job is hanging until our SRE killed the container from outside.
> Some screenshot are provided below.
> !image-2023-07-25-16-46-03-989.png!
> !image-2023-07-25-16-46-28-158.png!
> !image-2023-07-25-16-46-42-522.png!
> For this specific case: I'd like to the propose to eagerly load SparkExitCode 
> class in the 
> SparkUncaughtExceptionHandler, so that the halt process could be executed 
> rather than throws an exception as SparkExitCode is not loadable during the 
> previous scenario.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to