[ 
https://issues.apache.org/jira/browse/SPARK-15685?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Brett Randall updated SPARK-15685:
----------------------------------
    Description: 
Spark, when running in local mode, can encounter certain types of {{Error}} 
exceptions in developer-code or third-party libraries and call 
{{System.exit()}}, potentially killing a long-running JVM/service.  The caller 
should decide on the exception-handling and whether the error should be deemed 
fatal.

*Consider this scenario:*

* Spark is being used in local master mode within a long-running JVM 
microservice, e.g. a Jetty instance.
* A task is run.  The task errors with particular types of unchecked throwables:
** a) there some bad code and/or bad data that exposes a bug where there's 
unterminated recursion, leading to a {{StackOverflowError}}, or
** b) a particular not-often used function is called - there's a packaging 
error with the service, a third-party library is missing some dependencies, a 
{{NoClassDefFoundError}} is found.

*Expected behaviour:* Since we are running in local mode, we might expect some 
unchecked exception to be thrown, to be optionally-handled by the Spark caller. 
 In the case of Jetty, a request thread or some other background worker thread 
might handle the exception or not, the thread might exit or note an error.  The 
caller should decide how the error is handled.

*Actual behaviour:* {{System.exit()}} is called, the JVM exits and the Jetty 
microservice is down and must be restarted.

*Consequence:* Any local code or third-party library might cause Spark to exit 
the long-running JVM/microservice, so Spark can be a problem in this 
architecture.  I have seen this now on three separate occasions, leading to 
service-down bug reports.

*Analysis:*

The line of code that seems to be the problem is: 
https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/executor/Executor.scala#L405

{code}
// Don't forcibly exit unless the exception was inherently fatal, to avoid
// stopping other tasks unnecessarily.
if (Utils.isFatalError(t)) {
    SparkUncaughtExceptionHandler.uncaughtException(t)
}
{code}

[Utils.isFatalError()|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/util/Utils.scala#L1818]
 first excludes Scala 
[NonFatal|https://github.com/scala/scala/blob/2.12.x/src/library/scala/util/control/NonFatal.scala#L31],
 which excludes everything except {{VirtualMachineError}}, {{ThreadDeath}}, 
{{InterruptedException}}, {{LinkageError}} and {{ControlThrowable}}.  
{{Utils.isFatalError()}} further excludes {{InterruptedException}}, 
{{NotImplementedError}} and {{ControlThrowable}}.

Remaining are {{Error}}s such as {{StackOverflowError extends 
VirtualMachineError}} or {{NoClassDefFoundError extends LinkageError}}, which 
occur in the aforementioned scenarios.  
{{SparkUncaughtExceptionHandler.uncaughtException()}} proceeds to call 
{{System.exit()}}.

[Further 
up|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/executor/Executor.scala#L77]
 in in {{Executor}} we see exclusions for registering 
{{SparkUncaughtExceptionHandler}} if in local mode:

{code}
  if (!isLocal) {
    // Setup an uncaught exception handler for non-local mode.
    // Make any thread terminations due to uncaught exceptions kill the entire
    // executor process to avoid surprising stalls.
    Thread.setDefaultUncaughtExceptionHandler(SparkUncaughtExceptionHandler)
  }
{code}

This same exclusion must be applied for local mode for "fatal" errors - cannot 
afford to shutdown the enclosing JVM (e.g. Jetty), the caller should decide.

A minimal test-case is supplied.  It installs a logging {{SecurityManager}} to 
confirm that {{System.exit()}} was called from 
{{SparkUncaughtExceptionHandler.uncaughtException}} via {{Executor}}.  It also 
hints at the workaround - install your own {{SecurityManager}} and inspect the 
current stack in {{checkExit()}} to prevent Spark from exiting the JVM.

Test-case: https://github.com/javabrett/SPARK-15685 .

  was:
Spark, when running in local mode, can encounter certain types of {{Error}} 
exceptions in developer-code or third-party libraries and call 
{{System.exit()}}, potentially killing a long-running JVM/service.  The caller 
should decide on the exception-handling and whether the error should be deemed 
fatal.

*Consider this scenario:*

* Spark is being used in local master mode within a long-running JVM 
microservice, e.g. a Jetty instance.
* A task is run.  The task errors with particular types of unchecked throwables:
** a) there some bad code and/or bad data that exposes a bug where there's 
unterminated recursion, leading to a {{StackOverflowError}}, or
** b) a particular not-often used function is called - there's a packaging 
error with the service, a third-party library is missing some dependencies, a 
{{NoClassDefFoundError}} is found.

*Expected behaviour:* Since we are running in local mode, we might expect some 
unchecked exception to be thrown, to be optionally-handled by the Spark caller. 
 In the case of Jetty, a request thread or some other background worker thread 
might handle the exception or not, the thread might exit or note an error.  The 
caller should decide how the error is handled.

*Actual behaviour:* {{System.exit()}} is called, the JVM exits and the Jetty 
microservice is down and must be restarted.

*Consequence:* Any local code or third-party library might cause Spark to exit 
the long-running JVM/microservice, so Spark can be a problem in this 
architecture.  I have seen this now on three separate occasions, leading to 
service-down bug reports.

*Analysis:*

The line of code that seems to be the problem is: 
https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/executor/Executor.scala#L405

{code}
// Don't forcibly exit unless the exception was inherently fatal, to avoid
// stopping other tasks unnecessarily.
if (Utils.isFatalError(t)) {
    SparkUncaughtExceptionHandler.uncaughtException(t)
}
{code}

[Utils.isFatalError()|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/util/Utils.scala#L1818]
 first excludes Scala 
[NonFatal|https://github.com/scala/scala/blob/2.12.x/src/library/scala/util/control/NonFatal.scala#L31],
 which excludes everything except {{VirtualMachineError}}, {{ThreadDeath}}, 
{{InterruptedException}}, {{LinkageError}} and {{ControlThrowable}}.  
{{Utils.isFatalError()}} further excludes {{InterruptedException}}, 
{{NotImplementedError}} and {{ControlThrowable}}.

Remaining are {{Error}}s such as {{StackOverflowError extends 
VirtualMachineError}} or {{NoClassDefFoundError extends LinkageError}}, which 
occur in the aforementioned scenarios.  
{{SparkUncaughtExceptionHandler.uncaughtException()}} proceeds to call 
{{System.exit()}}.

[Further 
up|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/executor/Executor.scala#L77]
 in in {{Executor}} we see exclusions for registering 
{{SparkUncaughtExceptionHandler}} if in local mode:

{code}
  if (!isLocal) {
    // Setup an uncaught exception handler for non-local mode.
    // Make any thread terminations due to uncaught exceptions kill the entire
    // executor process to avoid surprising stalls.
    Thread.setDefaultUncaughtExceptionHandler(SparkUncaughtExceptionHandler)
  }
{code}

This same exclusion must be applied for local mode for "fatal" errors - cannot 
afford to shutdown the enclosing JVM (e.g. Jetty), the caller should decide.

A minimal test-case is supplied.  It installs a logging {{SecurityManager}} to 
confirm that {{System.exit()}} was called from 
{{SparkUncaughtExceptionHandler.uncaughtException}} via {{Executor}}.  It also 
hints at the workaround - install your own {{SecurityManager}} and inspect the 
current stack in {{checkExit()}} to prevent Spark from exiting the JVM.


> StackOverflowError (VirtualMachineError) or NoClassDefFoundError 
> (LinkageError) should not System.exit() in local mode
> ----------------------------------------------------------------------------------------------------------------------
>
>                 Key: SPARK-15685
>                 URL: https://issues.apache.org/jira/browse/SPARK-15685
>             Project: Spark
>          Issue Type: Bug
>          Components: Spark Core
>    Affects Versions: 1.6.1
>            Reporter: Brett Randall
>            Priority: Critical
>
> Spark, when running in local mode, can encounter certain types of {{Error}} 
> exceptions in developer-code or third-party libraries and call 
> {{System.exit()}}, potentially killing a long-running JVM/service.  The 
> caller should decide on the exception-handling and whether the error should 
> be deemed fatal.
> *Consider this scenario:*
> * Spark is being used in local master mode within a long-running JVM 
> microservice, e.g. a Jetty instance.
> * A task is run.  The task errors with particular types of unchecked 
> throwables:
> ** a) there some bad code and/or bad data that exposes a bug where there's 
> unterminated recursion, leading to a {{StackOverflowError}}, or
> ** b) a particular not-often used function is called - there's a packaging 
> error with the service, a third-party library is missing some dependencies, a 
> {{NoClassDefFoundError}} is found.
> *Expected behaviour:* Since we are running in local mode, we might expect 
> some unchecked exception to be thrown, to be optionally-handled by the Spark 
> caller.  In the case of Jetty, a request thread or some other background 
> worker thread might handle the exception or not, the thread might exit or 
> note an error.  The caller should decide how the error is handled.
> *Actual behaviour:* {{System.exit()}} is called, the JVM exits and the Jetty 
> microservice is down and must be restarted.
> *Consequence:* Any local code or third-party library might cause Spark to 
> exit the long-running JVM/microservice, so Spark can be a problem in this 
> architecture.  I have seen this now on three separate occasions, leading to 
> service-down bug reports.
> *Analysis:*
> The line of code that seems to be the problem is: 
> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/executor/Executor.scala#L405
> {code}
> // Don't forcibly exit unless the exception was inherently fatal, to avoid
> // stopping other tasks unnecessarily.
> if (Utils.isFatalError(t)) {
>     SparkUncaughtExceptionHandler.uncaughtException(t)
> }
> {code}
> [Utils.isFatalError()|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/util/Utils.scala#L1818]
>  first excludes Scala 
> [NonFatal|https://github.com/scala/scala/blob/2.12.x/src/library/scala/util/control/NonFatal.scala#L31],
>  which excludes everything except {{VirtualMachineError}}, {{ThreadDeath}}, 
> {{InterruptedException}}, {{LinkageError}} and {{ControlThrowable}}.  
> {{Utils.isFatalError()}} further excludes {{InterruptedException}}, 
> {{NotImplementedError}} and {{ControlThrowable}}.
> Remaining are {{Error}}s such as {{StackOverflowError extends 
> VirtualMachineError}} or {{NoClassDefFoundError extends LinkageError}}, which 
> occur in the aforementioned scenarios.  
> {{SparkUncaughtExceptionHandler.uncaughtException()}} proceeds to call 
> {{System.exit()}}.
> [Further 
> up|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/executor/Executor.scala#L77]
>  in in {{Executor}} we see exclusions for registering 
> {{SparkUncaughtExceptionHandler}} if in local mode:
> {code}
>   if (!isLocal) {
>     // Setup an uncaught exception handler for non-local mode.
>     // Make any thread terminations due to uncaught exceptions kill the entire
>     // executor process to avoid surprising stalls.
>     Thread.setDefaultUncaughtExceptionHandler(SparkUncaughtExceptionHandler)
>   }
> {code}
> This same exclusion must be applied for local mode for "fatal" errors - 
> cannot afford to shutdown the enclosing JVM (e.g. Jetty), the caller should 
> decide.
> A minimal test-case is supplied.  It installs a logging {{SecurityManager}} 
> to confirm that {{System.exit()}} was called from 
> {{SparkUncaughtExceptionHandler.uncaughtException}} via {{Executor}}.  It 
> also hints at the workaround - install your own {{SecurityManager}} and 
> inspect the current stack in {{checkExit()}} to prevent Spark from exiting 
> the JVM.
> Test-case: https://github.com/javabrett/SPARK-15685 .



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to