[jira] [Commented] (SPARK-40320) When the Executor plugin fails to initialize, the Executor shows active but does not accept tasks forever, just like being hung

2022-09-22 Thread wuyi (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40320?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17608268#comment-17608268
 ] 

wuyi commented on SPARK-40320:
--

I see. Thanks for the explaination. 

> When the Executor plugin fails to initialize, the Executor shows active but 
> does not accept tasks forever, just like being hung
> ---
>
> Key: SPARK-40320
> URL: https://issues.apache.org/jira/browse/SPARK-40320
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler
>Affects Versions: 3.0.0
>Reporter: Mars
>Priority: Major
>
> *Reproduce step:*
> set `spark.plugins=ErrorSparkPlugin`
> `ErrorSparkPlugin` && `ErrorExecutorPlugin` class as below (I abbreviate the 
> code to make it clearer):
> {code:java}
> class ErrorSparkPlugin extends SparkPlugin {
>   /**
>*/
>   override def driverPlugin(): DriverPlugin =  new ErrorDriverPlugin()
>   /**
>*/
>   override def executorPlugin(): ExecutorPlugin = new ErrorExecutorPlugin()
> }{code}
> {code:java}
> class ErrorExecutorPlugin extends ExecutorPlugin {
>   private val checkingInterval: Long = 1
>   override def init(_ctx: PluginContext, extraConf: util.Map[String, 
> String]): Unit = {
> if (checkingInterval == 1) {
>   throw new UnsatisfiedLinkError("My Exception error")
> }
>   }
> } {code}
> The Executor is active when we check in spark-ui, however it was broken and 
> doesn't receive any task.
> *Root Cause:*
> I check the code and I find in `org.apache.spark.rpc.netty.Inbox#safelyCall` 
> it will throw fatal error (`UnsatisfiedLinkError` is fatal erro ) in method 
> `dealWithFatalError` . Actually the  `CoarseGrainedExecutorBackend` JVM 
> process  is active but the  communication thread is no longer working ( 
> please see  `MessageLoop#receiveLoopRunnable` , `receiveLoop()` was broken, 
> so executor doesn't receive any message)
> Some ideas:
> I think it is very hard to know what happened here unless we check in the 
> code. The Executor is active but it can't do anything. We will wonder if the 
> driver is broken or the Executor problem.  I think at least the Executor 
> status shouldn't be active here or the Executor can exitExecutor (kill itself)
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-40320) When the Executor plugin fails to initialize, the Executor shows active but does not accept tasks forever, just like being hung

2022-09-07 Thread Mars (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40320?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17601543#comment-17601543
 ] 

Mars commented on SPARK-40320:
--

[~Ngone51] 
Shouldn't it bring up a new `receiveLoop()` to serve RPC messages?
Yes, my previous thinking was wrong. I remote debug on Executor and I found 
that it did catch the fatal error in 
[https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rpc/netty/MessageLoop.scala#L82-L89]
 .
It will resubmit receiveLoop and block in 
[https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rpc/netty/MessageLoop.scala#L69]
But this Executor did not initialize successfully and didn't send 
LaunchedExecutor to Driver (you can see 
[https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/executor/CoarseGrainedExecutorBackend.scala#L172]
 )

So the Executor can't launch task, related PR 
[https://github.com/apache/spark/pull/25964] .

Why SparkUncaughtExceptionHandler doesn't catch the fatal error?
See 
[https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/executor/Executor.scala#L284]
 plugins is private variable, so it was broken when initialize Executor at the 
beginning.

> When the Executor plugin fails to initialize, the Executor shows active but 
> does not accept tasks forever, just like being hung
> ---
>
> Key: SPARK-40320
> URL: https://issues.apache.org/jira/browse/SPARK-40320
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler
>Affects Versions: 3.0.0
>Reporter: Mars
>Priority: Major
>
> *Reproduce step:*
> set `spark.plugins=ErrorSparkPlugin`
> `ErrorSparkPlugin` && `ErrorExecutorPlugin` class as below (I abbreviate the 
> code to make it clearer):
> {code:java}
> class ErrorSparkPlugin extends SparkPlugin {
>   /**
>*/
>   override def driverPlugin(): DriverPlugin =  new ErrorDriverPlugin()
>   /**
>*/
>   override def executorPlugin(): ExecutorPlugin = new ErrorExecutorPlugin()
> }{code}
> {code:java}
> class ErrorExecutorPlugin extends ExecutorPlugin {
>   private val checkingInterval: Long = 1
>   override def init(_ctx: PluginContext, extraConf: util.Map[String, 
> String]): Unit = {
> if (checkingInterval == 1) {
>   throw new UnsatisfiedLinkError("My Exception error")
> }
>   }
> } {code}
> The Executor is active when we check in spark-ui, however it was broken and 
> doesn't receive any task.
> *Root Cause:*
> I check the code and I find in `org.apache.spark.rpc.netty.Inbox#safelyCall` 
> it will throw fatal error (`UnsatisfiedLinkError` is fatal erro ) in method 
> `dealWithFatalError` . Actually the  `CoarseGrainedExecutorBackend` JVM 
> process  is active but the  communication thread is no longer working ( 
> please see  `MessageLoop#receiveLoopRunnable` , `receiveLoop()` was broken, 
> so executor doesn't receive any message)
> Some ideas:
> I think it is very hard to know what happened here unless we check in the 
> code. The Executor is active but it can't do anything. We will wonder if the 
> driver is broken or the Executor problem.  I think at least the Executor 
> status shouldn't be active here or the Executor can exitExecutor (kill itself)
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-40320) When the Executor plugin fails to initialize, the Executor shows active but does not accept tasks forever, just like being hung

2022-09-06 Thread wuyi (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40320?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17600777#comment-17600777
 ] 

wuyi commented on SPARK-40320:
--

> Actually the  `CoarseGrainedExecutorBackend` JVM process  is active but the  
> communication thread is no longer working ( please see  
> `MessageLoop#receiveLoopRunnable` , `receiveLoop()` was broken, so executor 
> doesn't receive any message)

Hmm, shouldn't it bring up a new `receiveLoop()` to serve RPC messages 
according to 
[https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rpc/netty/MessageLoop.scala#L82-L89]
 . Why it doesn't?

 

Besides, why SparkUncaughtExceptionHandler doesn't catch the fatal error?

 

cc [~tgraves] [~mridulm80] 

> When the Executor plugin fails to initialize, the Executor shows active but 
> does not accept tasks forever, just like being hung
> ---
>
> Key: SPARK-40320
> URL: https://issues.apache.org/jira/browse/SPARK-40320
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler
>Affects Versions: 3.0.0
>Reporter: Mars
>Priority: Major
>
> *Reproduce step:*
> set `spark.plugins=ErrorSparkPlugin`
> `ErrorSparkPlugin` && `ErrorExecutorPlugin` class as below (I abbreviate the 
> code to make it clearer):
> {code:java}
> class ErrorSparkPlugin extends SparkPlugin {
>   /**
>*/
>   override def driverPlugin(): DriverPlugin =  new ErrorDriverPlugin()
>   /**
>*/
>   override def executorPlugin(): ExecutorPlugin = new ErrorExecutorPlugin()
> }{code}
> {code:java}
> class ErrorExecutorPlugin extends ExecutorPlugin {
>   private val checkingInterval: Long = 1
>   override def init(_ctx: PluginContext, extraConf: util.Map[String, 
> String]): Unit = {
> if (checkingInterval == 1) {
>   throw new UnsatisfiedLinkError("My Exception error")
> }
>   }
> } {code}
> The Executor is active when we check in spark-ui, however it was broken and 
> doesn't receive any task.
> *Root Cause:*
> I check the code and I find in `org.apache.spark.rpc.netty.Inbox#safelyCall` 
> it will throw fatal error (`UnsatisfiedLinkError` is fatal erro ) in method 
> `dealWithFatalError` . Actually the  `CoarseGrainedExecutorBackend` JVM 
> process  is active but the  communication thread is no longer working ( 
> please see  `MessageLoop#receiveLoopRunnable` , `receiveLoop()` was broken, 
> so executor doesn't receive any message)
> Some ideas:
> I think it is very hard to know what happened here unless we check in the 
> code. The Executor is active but it can't do anything. We will wonder if the 
> driver is broken or the Executor problem.  I think at least the Executor 
> status shouldn't be active here or the Executor can exitExecutor (kill itself)
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-40320) When the Executor plugin fails to initialize, the Executor shows active but does not accept tasks forever, just like being hung

2022-09-02 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40320?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17599764#comment-17599764
 ] 

Apache Spark commented on SPARK-40320:
--

User 'yabola' has created a pull request for this issue:
https://github.com/apache/spark/pull/37779

> When the Executor plugin fails to initialize, the Executor shows active but 
> does not accept tasks forever, just like being hung
> ---
>
> Key: SPARK-40320
> URL: https://issues.apache.org/jira/browse/SPARK-40320
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler
>Affects Versions: 3.0.0
>Reporter: Mars
>Priority: Major
>
> *Reproduce step:*
> set `spark.plugins=ErrorSparkPlugin`
> `ErrorSparkPlugin` && `ErrorExecutorPlugin` class (I abbreviate the code to 
> make it clearer):
> {code:java}
> class ErrorSparkPlugin extends SparkPlugin {
>   /**
>*/
>   override def driverPlugin(): DriverPlugin =  new ErrorDriverPlugin()
>   /**
>*/
>   override def executorPlugin(): ExecutorPlugin = new ErrorExecutorPlugin()
> }{code}
> {code:java}
> class ErrorExecutorPlugin extends ExecutorPlugin with Logging {
>   private val checkingInterval: Long = 1
>   override def init(_ctx: PluginContext, extraConf: util.Map[String, 
> String]): Unit = {
> if (checkingInterval == 1) {
>   throw new UnsatisfiedLinkError("LCL my Exception error2")
> }
>   }
> } {code}
> The Executor is active when we check in spark-ui, however it was broken and 
> doesn't receive any task.
> *Root Cause:*
> I check the code and I find in `org.apache.spark.rpc.netty.Inbox#safelyCall` 
> it will throw fatal error (`UnsatisfiedLinkError` is fatal error here ) in 
> method `dealWithFatalError` . Actually the  `CoarseGrainedExecutorBackend` 
> JVM process  is active but the  communication thread is no longer working ( 
> please see  `MessageLoop#receiveLoopRunnable` , `receiveLoop()` while was 
> broken here, so executor doesn't receive any message)
> Some ideas:
> I think it is very hard to know what happened here unless we check in the 
> code. The Executor is active but it can't do anything. We will wonder if the 
> driver is broken or the Executor problem.  I think at least the Executor 
> status shouldn't be active here or the Executor can exitExecutor (kill itself)
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-40320) When the Executor plugin fails to initialize, the Executor shows active but does not accept tasks forever, just like being hung

2022-09-02 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40320?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17599763#comment-17599763
 ] 

Apache Spark commented on SPARK-40320:
--

User 'yabola' has created a pull request for this issue:
https://github.com/apache/spark/pull/37779

> When the Executor plugin fails to initialize, the Executor shows active but 
> does not accept tasks forever, just like being hung
> ---
>
> Key: SPARK-40320
> URL: https://issues.apache.org/jira/browse/SPARK-40320
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler
>Affects Versions: 3.0.0
>Reporter: Mars
>Priority: Major
>
> *Reproduce step:*
> set `spark.plugins=ErrorSparkPlugin`
> `ErrorSparkPlugin` && `ErrorExecutorPlugin` class (I abbreviate the code to 
> make it clearer):
> {code:java}
> class ErrorSparkPlugin extends SparkPlugin {
>   /**
>*/
>   override def driverPlugin(): DriverPlugin =  new ErrorDriverPlugin()
>   /**
>*/
>   override def executorPlugin(): ExecutorPlugin = new ErrorExecutorPlugin()
> }{code}
> {code:java}
> class ErrorExecutorPlugin extends ExecutorPlugin with Logging {
>   private val checkingInterval: Long = 1
>   override def init(_ctx: PluginContext, extraConf: util.Map[String, 
> String]): Unit = {
> if (checkingInterval == 1) {
>   throw new UnsatisfiedLinkError("LCL my Exception error2")
> }
>   }
> } {code}
> The Executor is active when we check in spark-ui, however it was broken and 
> doesn't receive any task.
> *Root Cause:*
> I check the code and I find in `org.apache.spark.rpc.netty.Inbox#safelyCall` 
> it will throw fatal error (`UnsatisfiedLinkError` is fatal error here ) in 
> method `dealWithFatalError` . Actually the  `CoarseGrainedExecutorBackend` 
> JVM process  is active but the  communication thread is no longer working ( 
> please see  `MessageLoop#receiveLoopRunnable` , `receiveLoop()` while was 
> broken here, so executor doesn't receive any message)
> Some ideas:
> I think it is very hard to know what happened here unless we check in the 
> code. The Executor is active but it can't do anything. We will wonder if the 
> driver is broken or the Executor problem.  I think at least the Executor 
> status shouldn't be active here or the Executor can exitExecutor (kill itself)
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org