[ 
https://issues.apache.org/jira/browse/SPARK-50034?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-50034.
---------------------------------
    Fix Version/s: 4.0.0
         Assignee: Mingkang Li
       Resolution: Fixed

> Fix Misreporting of Fatal Errors as Uncaught Exceptions in 
> SparkUncaughtExceptionHandler
> ----------------------------------------------------------------------------------------
>
>                 Key: SPARK-50034
>                 URL: https://issues.apache.org/jira/browse/SPARK-50034
>             Project: Spark
>          Issue Type: Bug
>          Components: Spark Core
>    Affects Versions: 4.0.0
>            Reporter: Mingkang Li
>            Assignee: Mingkang Li
>            Priority: Major
>              Labels: pull-request-available
>             Fix For: 4.0.0
>
>
> In {{{}Executor.scala{}}}, an exception is considered fatal (determined by 
> isFatalError()) if any exception in the chain (or its causes) is fatal. This 
> behavior is controlled by the {{spark.executor.killOnFatalError.depth}} 
> config, which limits the depth to which the chain is inspected. If a fatal 
> error is found, {{SparkUncaughtExceptionHandler}} is called.
> However, currently {{SparkUncaughtExceptionHandler}} only considers the 
> top-level exception when reporting the exit code, rather than traversing the 
> full exception chain to identify the true fatal cause. As a result, some 
> fatal errors, such as {{{}OutOfMemoryError{}}}, are mistakenly reported as 
> uncaught exceptions.
> For instance, if we have an OOM exception with the following structure:
> RuntimeException
>  - Caused by: RuntimeException
>  - Caused by: java.lang.OutOfMemory
> {{SparkUncaughtExceptionHandler}} would quit the executor with error code 
> SparkExitCode.UNCAUGHT_EXCEPTION, when the true cause is an OOM error.
> This change intends to modify {{SparkUncaughtExceptionHandler}} to:
>  * Inspect the exception chain (up to the configured depth).
>  * Ensure that the actual fatal error is correctly identified and reflected 
> in the exit code.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to