[jira] [Updated] (SPARK-40320) When the Executor plugin fails to initialize, the Executor shows active but does not accept tasks forever, just like being hung

2022-10-26 Thread Mridul Muralidharan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40320?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mridul Muralidharan updated SPARK-40320:

Fix Version/s: 3.4.0

> When the Executor plugin fails to initialize, the Executor shows active but 
> does not accept tasks forever, just like being hung
> ---
>
> Key: SPARK-40320
> URL: https://issues.apache.org/jira/browse/SPARK-40320
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler
>Affects Versions: 3.0.0
>Reporter: Mars
>Priority: Major
> Fix For: 3.4.0
>
>
> *Reproduce step:*
> set `spark.plugins=ErrorSparkPlugin`
> `ErrorSparkPlugin` && `ErrorExecutorPlugin` class as below (I abbreviate the 
> code to make it clearer):
> {code:java}
> class ErrorSparkPlugin extends SparkPlugin {
>   /**
>*/
>   override def driverPlugin(): DriverPlugin =  new ErrorDriverPlugin()
>   /**
>*/
>   override def executorPlugin(): ExecutorPlugin = new ErrorExecutorPlugin()
> }{code}
> {code:java}
> class ErrorExecutorPlugin extends ExecutorPlugin {
>   private val checkingInterval: Long = 1
>   override def init(_ctx: PluginContext, extraConf: util.Map[String, 
> String]): Unit = {
> if (checkingInterval == 1) {
>   throw new UnsatisfiedLinkError("My Exception error")
> }
>   }
> } {code}
> The Executor is active when we check in spark-ui, however it was broken and 
> doesn't receive any task.
> *Root Cause:*
> I check the code and I find in `org.apache.spark.rpc.netty.Inbox#safelyCall` 
> it will throw fatal error (`UnsatisfiedLinkError` is fatal erro ) in method 
> `dealWithFatalError` . Actually the  `CoarseGrainedExecutorBackend` JVM 
> process  is active but the  communication thread is no longer working ( 
> please see  `MessageLoop#receiveLoopRunnable` , `receiveLoop()` was broken, 
> so executor doesn't receive any message)
> Some ideas:
> I think it is very hard to know what happened here unless we check in the 
> code. The Executor is active but it can't do anything. We will wonder if the 
> driver is broken or the Executor problem.  I think at least the Executor 
> status shouldn't be active here or the Executor can exitExecutor (kill itself)
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-40320) When the Executor plugin fails to initialize, the Executor shows active but does not accept tasks forever, just like being hung

2022-09-04 Thread Mars (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40320?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mars updated SPARK-40320:
-
Description: 
*Reproduce step:*
set `spark.plugins=ErrorSparkPlugin`
`ErrorSparkPlugin` && `ErrorExecutorPlugin` class as below (I abbreviate the 
code to make it clearer):
{code:java}
class ErrorSparkPlugin extends SparkPlugin {
  /**
   */
  override def driverPlugin(): DriverPlugin =  new ErrorDriverPlugin()

  /**
   */
  override def executorPlugin(): ExecutorPlugin = new ErrorExecutorPlugin()
}{code}
{code:java}
class ErrorExecutorPlugin extends ExecutorPlugin {
  private val checkingInterval: Long = 1

  override def init(_ctx: PluginContext, extraConf: util.Map[String, String]): 
Unit = {
if (checkingInterval == 1) {
  throw new UnsatisfiedLinkError("My Exception error")
}
  }
} {code}
The Executor is active when we check in spark-ui, however it was broken and 
doesn't receive any task.

*Root Cause:*

I check the code and I find in `org.apache.spark.rpc.netty.Inbox#safelyCall` it 
will throw fatal error (`UnsatisfiedLinkError` is fatal erro ) in method 
`dealWithFatalError` . Actually the  `CoarseGrainedExecutorBackend` JVM process 
 is active but the  communication thread is no longer working ( please see  
`MessageLoop#receiveLoopRunnable` , `receiveLoop()` was broken, so executor 
doesn't receive any message)

Some ideas:
I think it is very hard to know what happened here unless we check in the code. 
The Executor is active but it can't do anything. We will wonder if the driver 
is broken or the Executor problem.  I think at least the Executor status 
shouldn't be active here or the Executor can exitExecutor (kill itself)

 

  was:
*Reproduce step:*
set `spark.plugins=ErrorSparkPlugin`
`ErrorSparkPlugin` && `ErrorExecutorPlugin` class as below (I abbreviate the 
code to make it clearer):
{code:java}
class ErrorSparkPlugin extends SparkPlugin {
  /**
   */
  override def driverPlugin(): DriverPlugin =  new ErrorDriverPlugin()

  /**
   */
  override def executorPlugin(): ExecutorPlugin = new ErrorExecutorPlugin()
}{code}
{code:java}
class ErrorExecutorPlugin extends ExecutorPlugin {
  private val checkingInterval: Long = 1

  override def init(_ctx: PluginContext, extraConf: util.Map[String, String]): 
Unit = {
if (checkingInterval == 1) {
  throw new UnsatisfiedLinkError("My Exception error")
}
  }
} {code}
The Executor is active when we check in spark-ui, however it was broken and 
doesn't receive any task.

*Root Cause:*

I check the code and I find in `org.apache.spark.rpc.netty.Inbox#safelyCall` it 
will throw fatal error (`UnsatisfiedLinkError` is fatal error here ) in method 
`dealWithFatalError` . Actually the  `CoarseGrainedExecutorBackend` JVM process 
 is active but the  communication thread is no longer working ( please see  
`MessageLoop#receiveLoopRunnable` , `receiveLoop()` was broken, so executor 
doesn't receive any message)

Some ideas:
I think it is very hard to know what happened here unless we check in the code. 
The Executor is active but it can't do anything. We will wonder if the driver 
is broken or the Executor problem.  I think at least the Executor status 
shouldn't be active here or the Executor can exitExecutor (kill itself)

 


> When the Executor plugin fails to initialize, the Executor shows active but 
> does not accept tasks forever, just like being hung
> ---
>
> Key: SPARK-40320
> URL: https://issues.apache.org/jira/browse/SPARK-40320
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler
>Affects Versions: 3.0.0
>Reporter: Mars
>Priority: Major
>
> *Reproduce step:*
> set `spark.plugins=ErrorSparkPlugin`
> `ErrorSparkPlugin` && `ErrorExecutorPlugin` class as below (I abbreviate the 
> code to make it clearer):
> {code:java}
> class ErrorSparkPlugin extends SparkPlugin {
>   /**
>*/
>   override def driverPlugin(): DriverPlugin =  new ErrorDriverPlugin()
>   /**
>*/
>   override def executorPlugin(): ExecutorPlugin = new ErrorExecutorPlugin()
> }{code}
> {code:java}
> class ErrorExecutorPlugin extends ExecutorPlugin {
>   private val checkingInterval: Long = 1
>   override def init(_ctx: PluginContext, extraConf: util.Map[String, 
> String]): Unit = {
> if (checkingInterval == 1) {
>   throw new UnsatisfiedLinkError("My Exception error")
> }
>   }
> } {code}
> The Executor is active when we check in spark-ui, however it was broken and 
> doesn't receive any task.
> *Root Cause:*
> I check the code and I find in `org.apache.spark.rpc.netty.Inbox#safelyCall` 
> it will throw fatal error (`UnsatisfiedLinkError` is fatal erro ) in method 
> `dealWithFatalError` . Actually the  `CoarseGrainedExecutorBackend` JVM 
> 

[jira] [Updated] (SPARK-40320) When the Executor plugin fails to initialize, the Executor shows active but does not accept tasks forever, just like being hung

2022-09-04 Thread Mars (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40320?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mars updated SPARK-40320:
-
Description: 
*Reproduce step:*
set `spark.plugins=ErrorSparkPlugin`
`ErrorSparkPlugin` && `ErrorExecutorPlugin` class as below (I abbreviate the 
code to make it clearer):
{code:java}
class ErrorSparkPlugin extends SparkPlugin {
  /**
   */
  override def driverPlugin(): DriverPlugin =  new ErrorDriverPlugin()

  /**
   */
  override def executorPlugin(): ExecutorPlugin = new ErrorExecutorPlugin()
}{code}
{code:java}
class ErrorExecutorPlugin extends ExecutorPlugin {
  private val checkingInterval: Long = 1

  override def init(_ctx: PluginContext, extraConf: util.Map[String, String]): 
Unit = {
if (checkingInterval == 1) {
  throw new UnsatisfiedLinkError("My Exception error")
}
  }
} {code}
The Executor is active when we check in spark-ui, however it was broken and 
doesn't receive any task.

*Root Cause:*

I check the code and I find in `org.apache.spark.rpc.netty.Inbox#safelyCall` it 
will throw fatal error (`UnsatisfiedLinkError` is fatal error here ) in method 
`dealWithFatalError` . Actually the  `CoarseGrainedExecutorBackend` JVM process 
 is active but the  communication thread is no longer working ( please see  
`MessageLoop#receiveLoopRunnable` , `receiveLoop()` was broken, so executor 
doesn't receive any message)

Some ideas:
I think it is very hard to know what happened here unless we check in the code. 
The Executor is active but it can't do anything. We will wonder if the driver 
is broken or the Executor problem.  I think at least the Executor status 
shouldn't be active here or the Executor can exitExecutor (kill itself)

 

  was:
*Reproduce step:*
set `spark.plugins=ErrorSparkPlugin`
`ErrorSparkPlugin` && `ErrorExecutorPlugin` class (I abbreviate the code to 
make it clearer):
{code:java}
class ErrorSparkPlugin extends SparkPlugin {
  /**
   */
  override def driverPlugin(): DriverPlugin =  new ErrorDriverPlugin()

  /**
   */
  override def executorPlugin(): ExecutorPlugin = new ErrorExecutorPlugin()
}{code}
{code:java}
class ErrorExecutorPlugin extends ExecutorPlugin with Logging {
  private val checkingInterval: Long = 1

  override def init(_ctx: PluginContext, extraConf: util.Map[String, String]): 
Unit = {
if (checkingInterval == 1) {
  throw new UnsatisfiedLinkError("LCL my Exception error2")
}
  }
} {code}
The Executor is active when we check in spark-ui, however it was broken and 
doesn't receive any task.

*Root Cause:*

I check the code and I find in `org.apache.spark.rpc.netty.Inbox#safelyCall` it 
will throw fatal error (`UnsatisfiedLinkError` is fatal error here ) in method 
`dealWithFatalError` . Actually the  `CoarseGrainedExecutorBackend` JVM process 
 is active but the  communication thread is no longer working ( please see  
`MessageLoop#receiveLoopRunnable` , `receiveLoop()` while was broken here, so 
executor doesn't receive any message)

Some ideas:
I think it is very hard to know what happened here unless we check in the code. 
The Executor is active but it can't do anything. We will wonder if the driver 
is broken or the Executor problem.  I think at least the Executor status 
shouldn't be active here or the Executor can exitExecutor (kill itself)

 


> When the Executor plugin fails to initialize, the Executor shows active but 
> does not accept tasks forever, just like being hung
> ---
>
> Key: SPARK-40320
> URL: https://issues.apache.org/jira/browse/SPARK-40320
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler
>Affects Versions: 3.0.0
>Reporter: Mars
>Priority: Major
>
> *Reproduce step:*
> set `spark.plugins=ErrorSparkPlugin`
> `ErrorSparkPlugin` && `ErrorExecutorPlugin` class as below (I abbreviate the 
> code to make it clearer):
> {code:java}
> class ErrorSparkPlugin extends SparkPlugin {
>   /**
>*/
>   override def driverPlugin(): DriverPlugin =  new ErrorDriverPlugin()
>   /**
>*/
>   override def executorPlugin(): ExecutorPlugin = new ErrorExecutorPlugin()
> }{code}
> {code:java}
> class ErrorExecutorPlugin extends ExecutorPlugin {
>   private val checkingInterval: Long = 1
>   override def init(_ctx: PluginContext, extraConf: util.Map[String, 
> String]): Unit = {
> if (checkingInterval == 1) {
>   throw new UnsatisfiedLinkError("My Exception error")
> }
>   }
> } {code}
> The Executor is active when we check in spark-ui, however it was broken and 
> doesn't receive any task.
> *Root Cause:*
> I check the code and I find in `org.apache.spark.rpc.netty.Inbox#safelyCall` 
> it will throw fatal error (`UnsatisfiedLinkError` is fatal error here ) in 
> method `dealWithFatalError` . Actually the  

[jira] [Updated] (SPARK-40320) When the Executor plugin fails to initialize, the Executor shows active but does not accept tasks forever, just like being hung

2022-09-02 Thread Mars (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40320?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mars updated SPARK-40320:
-
Description: 
Reproduce step:
set `spark.plugins=ErrorSparkPlugin`
`ErrorSparkPlugin` && `ErrorExecutorPlugin` class (I abbreviate the code to 
make it clearer):
{code:java}
class ErrorSparkPlugin extends SparkPlugin {
  /**
   */
  override def driverPlugin(): DriverPlugin =  new ErrorDriverPlugin()

  /**
   */
  override def executorPlugin(): ExecutorPlugin = new ErrorExecutorPlugin()
}{code}
{code:java}
class ErrorExecutorPlugin extends ExecutorPlugin with Logging {
  private val checkingInterval: Long = 1

  override def init(_ctx: PluginContext, extraConf: util.Map[String, String]): 
Unit = {
if (checkingInterval == 1) {
  throw new UnsatisfiedLinkError("LCL my Exception error2")
}
  }
} {code}
The Executor is active when we check in spark-ui, however it was broken and 
doesn't receive any task.

Root Cause:

I check the code and I find in `org.apache.spark.rpc.netty.Inbox#safelyCall` it 
will throw fatal error (`UnsatisfiedLinkError` is fatal error here ) in method 
`dealWithFatalError` . Actually the  `CoarseGrainedExecutorBackend` JVM process 
 is active but the  communication thread is no longer working ( please see  
`MessageLoop#receiveLoopRunnable` , `receiveLoop()` while was broken here, so 
executor doesn't receive any message)

Solution:
I think it is very hard to know what happened here unless we check in the code. 
The Executor is active but it can't do anything. We will wonder if the driver 
is broken or the Executor problem.  I think at least the Executor status 
shouldn't be active here or the Executor can exitExecutor (kill itself)

 

  was:
Reproduce step:
set `spark.plugins=ErrorSparkPlugin`
`ErrorSparkPlugin` && `ErrorExecutorPlugin` class (I abbreviate the code to 
make it clearer):
{code:java}
class ErrorSparkPlugin extends SparkPlugin {
  /**
   */
  override def driverPlugin(): DriverPlugin =  new ErrorDriverPlugin()

  /**
   */
  override def executorPlugin(): ExecutorPlugin = new ErrorExecutorPlugin()
}{code}
{code:java}
class ErrorExecutorPlugin extends ExecutorPlugin with Logging {
  private val checkingInterval: Long = 1

  override def init(_ctx: PluginContext, extraConf: util.Map[String, String]): 
Unit = {
if (checkingInterval == 1) {
  throw new UnsatisfiedLinkError("LCL my Exception error2")
}
  }
} {code}
The Executor is active when we check in spark-ui, however it was broken and 
doesn't receive any task.

Root Cause:

I check the code and I find in `org.apache.spark.rpc.netty.Inbox#safelyCall` it 
will throw fatal error (`UnsatisfiedLinkError` is fatal error here ) in method 
`dealWithFatalError` . Actually the  `CoarseGrainedExecutorBackend` JVM process 
 is active but the  communication thread is no longer working

 


> When the Executor plugin fails to initialize, the Executor shows active but 
> does not accept tasks forever, just like being hung
> ---
>
> Key: SPARK-40320
> URL: https://issues.apache.org/jira/browse/SPARK-40320
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler
>Affects Versions: 3.0.0
>Reporter: Mars
>Priority: Major
>
> Reproduce step:
> set `spark.plugins=ErrorSparkPlugin`
> `ErrorSparkPlugin` && `ErrorExecutorPlugin` class (I abbreviate the code to 
> make it clearer):
> {code:java}
> class ErrorSparkPlugin extends SparkPlugin {
>   /**
>*/
>   override def driverPlugin(): DriverPlugin =  new ErrorDriverPlugin()
>   /**
>*/
>   override def executorPlugin(): ExecutorPlugin = new ErrorExecutorPlugin()
> }{code}
> {code:java}
> class ErrorExecutorPlugin extends ExecutorPlugin with Logging {
>   private val checkingInterval: Long = 1
>   override def init(_ctx: PluginContext, extraConf: util.Map[String, 
> String]): Unit = {
> if (checkingInterval == 1) {
>   throw new UnsatisfiedLinkError("LCL my Exception error2")
> }
>   }
> } {code}
> The Executor is active when we check in spark-ui, however it was broken and 
> doesn't receive any task.
> Root Cause:
> I check the code and I find in `org.apache.spark.rpc.netty.Inbox#safelyCall` 
> it will throw fatal error (`UnsatisfiedLinkError` is fatal error here ) in 
> method `dealWithFatalError` . Actually the  `CoarseGrainedExecutorBackend` 
> JVM process  is active but the  communication thread is no longer working ( 
> please see  `MessageLoop#receiveLoopRunnable` , `receiveLoop()` while was 
> broken here, so executor doesn't receive any message)
> Solution:
> I think it is very hard to know what happened here unless we check in the 
> code. The Executor is active but it can't do anything. We will wonder if the 
> driver is broken or the Executor 

[jira] [Updated] (SPARK-40320) When the Executor plugin fails to initialize, the Executor shows active but does not accept tasks forever, just like being hung

2022-09-02 Thread Mars (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40320?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mars updated SPARK-40320:
-
Description: 
*Reproduce step:*
set `spark.plugins=ErrorSparkPlugin`
`ErrorSparkPlugin` && `ErrorExecutorPlugin` class (I abbreviate the code to 
make it clearer):
{code:java}
class ErrorSparkPlugin extends SparkPlugin {
  /**
   */
  override def driverPlugin(): DriverPlugin =  new ErrorDriverPlugin()

  /**
   */
  override def executorPlugin(): ExecutorPlugin = new ErrorExecutorPlugin()
}{code}
{code:java}
class ErrorExecutorPlugin extends ExecutorPlugin with Logging {
  private val checkingInterval: Long = 1

  override def init(_ctx: PluginContext, extraConf: util.Map[String, String]): 
Unit = {
if (checkingInterval == 1) {
  throw new UnsatisfiedLinkError("LCL my Exception error2")
}
  }
} {code}
The Executor is active when we check in spark-ui, however it was broken and 
doesn't receive any task.

*Root Cause:*

I check the code and I find in `org.apache.spark.rpc.netty.Inbox#safelyCall` it 
will throw fatal error (`UnsatisfiedLinkError` is fatal error here ) in method 
`dealWithFatalError` . Actually the  `CoarseGrainedExecutorBackend` JVM process 
 is active but the  communication thread is no longer working ( please see  
`MessageLoop#receiveLoopRunnable` , `receiveLoop()` while was broken here, so 
executor doesn't receive any message)

Some ideas:
I think it is very hard to know what happened here unless we check in the code. 
The Executor is active but it can't do anything. We will wonder if the driver 
is broken or the Executor problem.  I think at least the Executor status 
shouldn't be active here or the Executor can exitExecutor (kill itself)

 

  was:
*Reproduce step:*
set `spark.plugins=ErrorSparkPlugin`
`ErrorSparkPlugin` && `ErrorExecutorPlugin` class (I abbreviate the code to 
make it clearer):
{code:java}
class ErrorSparkPlugin extends SparkPlugin {
  /**
   */
  override def driverPlugin(): DriverPlugin =  new ErrorDriverPlugin()

  /**
   */
  override def executorPlugin(): ExecutorPlugin = new ErrorExecutorPlugin()
}{code}
{code:java}
class ErrorExecutorPlugin extends ExecutorPlugin with Logging {
  private val checkingInterval: Long = 1

  override def init(_ctx: PluginContext, extraConf: util.Map[String, String]): 
Unit = {
if (checkingInterval == 1) {
  throw new UnsatisfiedLinkError("LCL my Exception error2")
}
  }
} {code}
The Executor is active when we check in spark-ui, however it was broken and 
doesn't receive any task.

*Root Cause:*

I check the code and I find in `org.apache.spark.rpc.netty.Inbox#safelyCall` it 
will throw fatal error (`UnsatisfiedLinkError` is fatal error here ) in method 
`dealWithFatalError` . Actually the  `CoarseGrainedExecutorBackend` JVM process 
 is active but the  communication thread is no longer working ( please see  
`MessageLoop#receiveLoopRunnable` , `receiveLoop()` while was broken here, so 
executor doesn't receive any message)

Some idea:
I think it is very hard to know what happened here unless we check in the code. 
The Executor is active but it can't do anything. We will wonder if the driver 
is broken or the Executor problem.  I think at least the Executor status 
shouldn't be active here or the Executor can exitExecutor (kill itself)

 


> When the Executor plugin fails to initialize, the Executor shows active but 
> does not accept tasks forever, just like being hung
> ---
>
> Key: SPARK-40320
> URL: https://issues.apache.org/jira/browse/SPARK-40320
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler
>Affects Versions: 3.0.0
>Reporter: Mars
>Priority: Major
>
> *Reproduce step:*
> set `spark.plugins=ErrorSparkPlugin`
> `ErrorSparkPlugin` && `ErrorExecutorPlugin` class (I abbreviate the code to 
> make it clearer):
> {code:java}
> class ErrorSparkPlugin extends SparkPlugin {
>   /**
>*/
>   override def driverPlugin(): DriverPlugin =  new ErrorDriverPlugin()
>   /**
>*/
>   override def executorPlugin(): ExecutorPlugin = new ErrorExecutorPlugin()
> }{code}
> {code:java}
> class ErrorExecutorPlugin extends ExecutorPlugin with Logging {
>   private val checkingInterval: Long = 1
>   override def init(_ctx: PluginContext, extraConf: util.Map[String, 
> String]): Unit = {
> if (checkingInterval == 1) {
>   throw new UnsatisfiedLinkError("LCL my Exception error2")
> }
>   }
> } {code}
> The Executor is active when we check in spark-ui, however it was broken and 
> doesn't receive any task.
> *Root Cause:*
> I check the code and I find in `org.apache.spark.rpc.netty.Inbox#safelyCall` 
> it will throw fatal error (`UnsatisfiedLinkError` is fatal error here ) in 
> method `dealWithFatalError` 

[jira] [Updated] (SPARK-40320) When the Executor plugin fails to initialize, the Executor shows active but does not accept tasks forever, just like being hung

2022-09-02 Thread Mars (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40320?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mars updated SPARK-40320:
-
Description: 
*Reproduce step:*
set `spark.plugins=ErrorSparkPlugin`
`ErrorSparkPlugin` && `ErrorExecutorPlugin` class (I abbreviate the code to 
make it clearer):
{code:java}
class ErrorSparkPlugin extends SparkPlugin {
  /**
   */
  override def driverPlugin(): DriverPlugin =  new ErrorDriverPlugin()

  /**
   */
  override def executorPlugin(): ExecutorPlugin = new ErrorExecutorPlugin()
}{code}
{code:java}
class ErrorExecutorPlugin extends ExecutorPlugin with Logging {
  private val checkingInterval: Long = 1

  override def init(_ctx: PluginContext, extraConf: util.Map[String, String]): 
Unit = {
if (checkingInterval == 1) {
  throw new UnsatisfiedLinkError("LCL my Exception error2")
}
  }
} {code}
The Executor is active when we check in spark-ui, however it was broken and 
doesn't receive any task.

*Root Cause:*

I check the code and I find in `org.apache.spark.rpc.netty.Inbox#safelyCall` it 
will throw fatal error (`UnsatisfiedLinkError` is fatal error here ) in method 
`dealWithFatalError` . Actually the  `CoarseGrainedExecutorBackend` JVM process 
 is active but the  communication thread is no longer working ( please see  
`MessageLoop#receiveLoopRunnable` , `receiveLoop()` while was broken here, so 
executor doesn't receive any message)

Some idea:
I think it is very hard to know what happened here unless we check in the code. 
The Executor is active but it can't do anything. We will wonder if the driver 
is broken or the Executor problem.  I think at least the Executor status 
shouldn't be active here or the Executor can exitExecutor (kill itself)

 

  was:
Reproduce step:
set `spark.plugins=ErrorSparkPlugin`
`ErrorSparkPlugin` && `ErrorExecutorPlugin` class (I abbreviate the code to 
make it clearer):
{code:java}
class ErrorSparkPlugin extends SparkPlugin {
  /**
   */
  override def driverPlugin(): DriverPlugin =  new ErrorDriverPlugin()

  /**
   */
  override def executorPlugin(): ExecutorPlugin = new ErrorExecutorPlugin()
}{code}
{code:java}
class ErrorExecutorPlugin extends ExecutorPlugin with Logging {
  private val checkingInterval: Long = 1

  override def init(_ctx: PluginContext, extraConf: util.Map[String, String]): 
Unit = {
if (checkingInterval == 1) {
  throw new UnsatisfiedLinkError("LCL my Exception error2")
}
  }
} {code}
The Executor is active when we check in spark-ui, however it was broken and 
doesn't receive any task.

Root Cause:

I check the code and I find in `org.apache.spark.rpc.netty.Inbox#safelyCall` it 
will throw fatal error (`UnsatisfiedLinkError` is fatal error here ) in method 
`dealWithFatalError` . Actually the  `CoarseGrainedExecutorBackend` JVM process 
 is active but the  communication thread is no longer working ( please see  
`MessageLoop#receiveLoopRunnable` , `receiveLoop()` while was broken here, so 
executor doesn't receive any message)

Solution:
I think it is very hard to know what happened here unless we check in the code. 
The Executor is active but it can't do anything. We will wonder if the driver 
is broken or the Executor problem.  I think at least the Executor status 
shouldn't be active here or the Executor can exitExecutor (kill itself)

 


> When the Executor plugin fails to initialize, the Executor shows active but 
> does not accept tasks forever, just like being hung
> ---
>
> Key: SPARK-40320
> URL: https://issues.apache.org/jira/browse/SPARK-40320
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler
>Affects Versions: 3.0.0
>Reporter: Mars
>Priority: Major
>
> *Reproduce step:*
> set `spark.plugins=ErrorSparkPlugin`
> `ErrorSparkPlugin` && `ErrorExecutorPlugin` class (I abbreviate the code to 
> make it clearer):
> {code:java}
> class ErrorSparkPlugin extends SparkPlugin {
>   /**
>*/
>   override def driverPlugin(): DriverPlugin =  new ErrorDriverPlugin()
>   /**
>*/
>   override def executorPlugin(): ExecutorPlugin = new ErrorExecutorPlugin()
> }{code}
> {code:java}
> class ErrorExecutorPlugin extends ExecutorPlugin with Logging {
>   private val checkingInterval: Long = 1
>   override def init(_ctx: PluginContext, extraConf: util.Map[String, 
> String]): Unit = {
> if (checkingInterval == 1) {
>   throw new UnsatisfiedLinkError("LCL my Exception error2")
> }
>   }
> } {code}
> The Executor is active when we check in spark-ui, however it was broken and 
> doesn't receive any task.
> *Root Cause:*
> I check the code and I find in `org.apache.spark.rpc.netty.Inbox#safelyCall` 
> it will throw fatal error (`UnsatisfiedLinkError` is fatal error here ) in 
> method `dealWithFatalError` . 

[jira] [Updated] (SPARK-40320) When the Executor plugin fails to initialize, the Executor shows active but does not accept tasks forever, just like being hung

2022-09-02 Thread Mars (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40320?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mars updated SPARK-40320:
-
Description: 
Reproduce step:
set `spark.plugins=ErrorSparkPlugin`
`ErrorSparkPlugin` && `ErrorExecutorPlugin` class (I abbreviate the code to 
make it clearer):
{code:java}
class ErrorSparkPlugin extends SparkPlugin {
  /**
   */
  override def driverPlugin(): DriverPlugin =  new ErrorDriverPlugin()

  /**
   */
  override def executorPlugin(): ExecutorPlugin = new ErrorExecutorPlugin()
}{code}
{code:java}
class ErrorExecutorPlugin extends ExecutorPlugin with Logging {
  private val checkingInterval: Long = 1

  override def init(_ctx: PluginContext, extraConf: util.Map[String, String]): 
Unit = {
if (checkingInterval == 1) {
  throw new UnsatisfiedLinkError("LCL my Exception error2")
}
  }
} {code}
The Executor is active when we check in spark-ui, however it was broken and 
doesn't receive any task.

Root Cause:

I check the code and I find in `org.apache.spark.rpc.netty.Inbox#safelyCall` it 
will throw fatal error (`UnsatisfiedLinkError` is fatal error here ) in method 
`dealWithFatalError` . Actually the  `CoarseGrainedExecutorBackend` JVM process 
 is active but the  communication thread is no longer working

 

  was:
Reproduce step:
set `spark.plugins=ErrorSparkPlugin`
`ErrorSparkPlugin` && `ErrorExecutorPlugin` class (I abbreviate the code to 
make it clearer):
{code:java}
class ErrorSparkPlugin extends SparkPlugin {
  /**
   */
  override def driverPlugin(): DriverPlugin =  new ErrorDriverPlugin()

  /**
   */
  override def executorPlugin(): ExecutorPlugin = new ErrorExecutorPlugin()
}{code}
{code:java}
class ErrorExecutorPlugin extends ExecutorPlugin with Logging {
  private val checkingInterval: Long = 1

  override def init(_ctx: PluginContext, extraConf: util.Map[String, String]): 
Unit = {
if (checkingInterval == 1) {
  throw new UnsatisfiedLinkError("LCL my Exception error2")
}
  }
} {code}
The Executor is active when we check in spark-ui, however it was broken and 
doesn't receive any task.

Root Cause:

I check the code and I find in `org.apache.spark.rpc.netty.Inbox#safelyCall` it 
will throw fatal error (`UnsatisfiedLinkError` is fatal error here ) in method 
`dealWithFatalError` . Actually the executor 

 


> When the Executor plugin fails to initialize, the Executor shows active but 
> does not accept tasks forever, just like being hung
> ---
>
> Key: SPARK-40320
> URL: https://issues.apache.org/jira/browse/SPARK-40320
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler
>Affects Versions: 3.0.0
>Reporter: Mars
>Priority: Major
>
> Reproduce step:
> set `spark.plugins=ErrorSparkPlugin`
> `ErrorSparkPlugin` && `ErrorExecutorPlugin` class (I abbreviate the code to 
> make it clearer):
> {code:java}
> class ErrorSparkPlugin extends SparkPlugin {
>   /**
>*/
>   override def driverPlugin(): DriverPlugin =  new ErrorDriverPlugin()
>   /**
>*/
>   override def executorPlugin(): ExecutorPlugin = new ErrorExecutorPlugin()
> }{code}
> {code:java}
> class ErrorExecutorPlugin extends ExecutorPlugin with Logging {
>   private val checkingInterval: Long = 1
>   override def init(_ctx: PluginContext, extraConf: util.Map[String, 
> String]): Unit = {
> if (checkingInterval == 1) {
>   throw new UnsatisfiedLinkError("LCL my Exception error2")
> }
>   }
> } {code}
> The Executor is active when we check in spark-ui, however it was broken and 
> doesn't receive any task.
> Root Cause:
> I check the code and I find in `org.apache.spark.rpc.netty.Inbox#safelyCall` 
> it will throw fatal error (`UnsatisfiedLinkError` is fatal error here ) in 
> method `dealWithFatalError` . Actually the  `CoarseGrainedExecutorBackend` 
> JVM process  is active but the  communication thread is no longer working
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org