[ https://issues.apache.org/jira/browse/SPARK-40320?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17601543#comment-17601543 ]
Mars edited comment on SPARK-40320 at 9/7/22 10:50 PM: ------------------------------------------------------- [~Ngone51] Shouldn't it bring up a new `receiveLoop()` to serve RPC messages? Yes, my previous thinking was wrong. I remote debug on Executor and I found that it did catch the fatal error in [https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rpc/netty/MessageLoop.scala#L82-L89] . It will resubmit receiveLoop and in the second time it will block by [https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rpc/netty/MessageLoop.scala#L69] This Executor did not initialize successfully in the first time , so it didn't send LaunchedExecutor to Driver (you can see [https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/executor/CoarseGrainedExecutorBackend.scala#L172] ) So the Executor can't launch task, related PR [https://github.com/apache/spark/pull/25964] . Why SparkUncaughtExceptionHandler doesn't catch the fatal error? See [https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/executor/Executor.scala#L284] plugins is private variable, so it was broken when initialize Executor at the beginning. was (Author: JIRAUSER290821): [~Ngone51] Shouldn't it bring up a new `receiveLoop()` to serve RPC messages? Yes, my previous thinking was wrong. I remote debug on Executor and I found that it did catch the fatal error in [https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rpc/netty/MessageLoop.scala#L82-L89] . It will resubmit receiveLoop and in the second time it will block by [https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rpc/netty/MessageLoop.scala#L69] This Executor did not initialize successfully in the first time and didn't send LaunchedExecutor to Driver (you can see [https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/executor/CoarseGrainedExecutorBackend.scala#L172] ) So the Executor can't launch task, related PR [https://github.com/apache/spark/pull/25964] . Why SparkUncaughtExceptionHandler doesn't catch the fatal error? See [https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/executor/Executor.scala#L284] plugins is private variable, so it was broken when initialize Executor at the beginning. > When the Executor plugin fails to initialize, the Executor shows active but > does not accept tasks forever, just like being hung > ------------------------------------------------------------------------------------------------------------------------------- > > Key: SPARK-40320 > URL: https://issues.apache.org/jira/browse/SPARK-40320 > Project: Spark > Issue Type: Bug > Components: Scheduler > Affects Versions: 3.0.0 > Reporter: Mars > Priority: Major > > *Reproduce step:* > set `spark.plugins=ErrorSparkPlugin` > `ErrorSparkPlugin` && `ErrorExecutorPlugin` class as below (I abbreviate the > code to make it clearer): > {code:java} > class ErrorSparkPlugin extends SparkPlugin { > /** > */ > override def driverPlugin(): DriverPlugin = new ErrorDriverPlugin() > /** > */ > override def executorPlugin(): ExecutorPlugin = new ErrorExecutorPlugin() > }{code} > {code:java} > class ErrorExecutorPlugin extends ExecutorPlugin { > private val checkingInterval: Long = 1 > override def init(_ctx: PluginContext, extraConf: util.Map[String, > String]): Unit = { > if (checkingInterval == 1) { > throw new UnsatisfiedLinkError("My Exception error") > } > } > } {code} > The Executor is active when we check in spark-ui, however it was broken and > doesn't receive any task. > *Root Cause:* > I check the code and I find in `org.apache.spark.rpc.netty.Inbox#safelyCall` > it will throw fatal error (`UnsatisfiedLinkError` is fatal erro ) in method > `dealWithFatalError` . Actually the `CoarseGrainedExecutorBackend` JVM > process is active but the communication thread is no longer working ( > please see `MessageLoop#receiveLoopRunnable` , `receiveLoop()` was broken, > so executor doesn't receive any message) > Some ideas: > I think it is very hard to know what happened here unless we check in the > code. The Executor is active but it can't do anything. We will wonder if the > driver is broken or the Executor problem. I think at least the Executor > status shouldn't be active here or the Executor can exitExecutor (kill itself) > -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org