[ https://issues.apache.org/jira/browse/SPARK-44542?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
YE updated SPARK-44542: ----------------------- Summary: eagerly load SparkExitCode class in SparkUncaughtExceptionHandler (was: easily load SparkExitCode class in SparkUncaughtExceptionHandler) > eagerly load SparkExitCode class in SparkUncaughtExceptionHandler > ----------------------------------------------------------------- > > Key: SPARK-44542 > URL: https://issues.apache.org/jira/browse/SPARK-44542 > Project: Spark > Issue Type: Improvement > Components: Spark Core > Affects Versions: 3.1.3, 3.3.2, 3.4.1 > Reporter: YE > Priority: Major > Attachments: image-2023-07-25-16-46-03-989.png, > image-2023-07-25-16-46-28-158.png, image-2023-07-25-16-46-42-522.png > > > There are two background for this improvement proposal: > 1. When running spark on yarn, the disk might be corrupted during application > running. The corrupted disk might contain the spark jars(cache archive from > spark.yarn.archive). In that case , the executor JVM cannot load any spark > related classes any more. > 2. Spark leverages the OutputCommitCoordinator to avoid data race between > speculate tasks so that no tasks could commit the same partition in the same > time. In other words, once a task's commit request is allowed, other commit > requests would be denied until the committing task is failed. > > We encountered a corner case combined the above two cases, which makes the > spark hangs. A short timeline could be described as below: > # task 5372(tid: 21662) starts running in 21:55 > # the disk contains the spark archive for that task/executor is corrupted, > thus making the archive inaccessible from executor's JVM perspective, it > happened around 22:00 > # the task continues running, at 22:05, it requests commit from coordinator > and performs the commit. > # however due the corrupted disk, some exception raised in the executor JVM. > # The SparkUncaughtExceptionHandler kicks in, however as the jar/disk is > corrupted, the handler itself throws an exception, and the halt process > throws an exception too. > # The executor is hanging there, no more tasks are running. However the > authorized commit request is still valid in the driver side > # Speculate tasks start to click in, due to no commit permission, all > speculate tasks are killed/denied. > # The job is hanging until our SRE killed the container from outside. > Some screenshot are provided below. > !image-2023-07-25-16-46-03-989.png! > !image-2023-07-25-16-46-28-158.png! > !image-2023-07-25-16-46-42-522.png! > For this specific case: I'd like to the propose to eagerly load SparkExitCode > class in the > SparkUncaughtExceptionHandler, so that the halt process could be executed > rather than throws an exception as SparkExitCode is not loadable during the > previous scenario. -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org