[jira] [Commented] (SPARK-29965) Race in executor shutdown handling can lead to executor never fully unregistering

Jeroen Dries (Jira) Tue, 14 Jul 2020 23:45:20 -0700


    [ 
https://issues.apache.org/jira/browse/SPARK-29965?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17157936#comment-17157936
 ]


Jeroen Dries commented on SPARK-29965:
--------------------------------------

We seem to see the same (or similar?) issue on Spark 2.4.4. We have a long 
running application, running on a yarn (hortonworks 3.1.4.0-315) cluster with 
dynamic allocation and preemption enabled.

After a fresh restart, logs are fine, but a few days later, our logs are being 
spammed with messages like this:
{code:java}
20/07/15 08:34:10 WARN BlockManagerMasterEndpoint: Error trying to remove 
broadcast 258143 from block manager BlockManagerId(10469, epod036.vgt.vito.be, 
35676, None)
java.io.IOException: Failed to send RPC RPC 8615372230556941680 to 
/192.168.207.136:51296: java.nio.channels.ClosedChannelException
        at 
org.apache.spark.network.client.TransportClient$RpcChannelListener.handleFailure(TransportClient.java:362)
        at 
org.apache.spark.network.client.TransportClient$StdChannelListener.operationComplete(TransportClient.java:339)
        at 
io.netty.util.concurrent.DefaultPromise.notifyListener0(DefaultPromise.java:507)
        at 
io.netty.util.concurrent.DefaultPromise.notifyListenersNow(DefaultPromise.java:481)
        at 
io.netty.util.concurrent.DefaultPromise.notifyListeners(DefaultPromise.java:420)
        at 
io.netty.util.concurrent.DefaultPromise.tryFailure(DefaultPromise.java:122)
        at 
io.netty.channel.AbstractChannel$AbstractUnsafe.safeSetFailure(AbstractChannel.java:987)
        at 
io.netty.channel.AbstractChannel$AbstractUnsafe.write(AbstractChannel.java:869)
        at 
io.netty.channel.DefaultChannelPipeline$HeadContext.write(DefaultChannelPipeline.java:1316)
        at 
io.netty.channel.AbstractChannelHandlerContext.invokeWrite0(AbstractChannelHandlerContext.java:738)
        at 
io.netty.channel.AbstractChannelHandlerContext.invokeWrite(AbstractChannelHandlerContext.java:730)
        at 
io.netty.channel.AbstractChannelHandlerContext.access$1900(AbstractChannelHandlerContext.java:38)
        at 
io.netty.channel.AbstractChannelHandlerContext$AbstractWriteTask.write(AbstractChannelHandlerContext.java:1081)
        at 
io.netty.channel.AbstractChannelHandlerContext$WriteAndFlushTask.write(AbstractChannelHandlerContext.java:1128)
        at 
io.netty.channel.AbstractChannelHandlerContext$AbstractWriteTask.run(AbstractChannelHandlerContext.java:1070)
        at 
io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:163)
        at 
io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:403)
        at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:463)
        at 
io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:858)
        at 
io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:138)
        at java.lang.Thread.run(Thread.java:748)
Caused by: java.nio.channels.ClosedChannelException
        at io.netty.channel.AbstractChannel$AbstractUnsafe.write(...)(Unknown 
Source)
 {code}

> Race in executor shutdown handling can lead to executor never fully 
> unregistering
> ---------------------------------------------------------------------------------
>
>                 Key: SPARK-29965
>                 URL: https://issues.apache.org/jira/browse/SPARK-29965
>             Project: Spark
>          Issue Type: Bug
>          Components: Spark Core
>    Affects Versions: 3.0.0
>            Reporter: Marcelo Masiero Vanzin
>            Priority: Major
>
> I ran into a situation that I had never noticed before, but I seem to be able 
> to hit with just a few retries when using K8S with dynamic allocation.
> Basically, there's a race when killing an executor, where it may send a 
> heartbeat to the driver right at the wrong time during shutdown, e.g.:
> {noformat}
> 19/11/19 21:14:05 INFO CoarseGrainedExecutorBackend: Driver commanded a 
> shutdown
> 19/11/19 21:14:05 INFO Executor: Told to re-register on heartbeat
> 19/11/19 21:14:05 INFO BlockManager: BlockManager BlockManagerId(10, 
> 192.168.3.99, 39923, None) re-registering with master
> 19/11/19 21:14:05 INFO BlockManagerMaster: Registering BlockManager 
> BlockManagerId(10, 192.168.3.99, 39923, None)
> 19/11/19 21:14:05 INFO BlockManagerMaster: Registered BlockManager 
> BlockManagerId(10, 192.168.3.99, 39923, None)
> 19/11/19 21:14:06 INFO BlockManager: Reporting 0 blocks to the master.
> {noformat}
> On the driver side it will happily re-register the executor (time diff is 
> just because of time zone in log4j config):
> {noformat}
> 19/11/19 13:14:05 INFO BlockManagerMasterEndpoint: Trying to remove executor 
> 10 from BlockManagerMaster.
> 19/11/19 13:14:05 INFO BlockManagerMasterEndpoint: Removing block manager 
> BlockManagerId(10, 192.168.3.99, 39923, None)
> 19/11/19 13:14:05 INFO BlockManagerMaster: Removed 10 successfully in 
> removeExecutor
> 19/11/19 13:14:05 INFO DAGScheduler: Shuffle files lost for executor: 10 
> (epoch 18)
> {noformat}
> And a little later:
> {noformat}
> 19/11/19 13:14:05 DEBUG HeartbeatReceiver: Received heartbeat from unknown 
> executor 10
> 19/11/19 13:14:05 INFO BlockManagerMasterEndpoint: Registering block manager 
> 192.168.3.99:39923 with 413.9 MiB RAM, BlockManagerId(10, 192.168.3.99, 
> 39923, None)
> {noformat}
> This becomes a problem later, where you start to see period exceptions in the 
> driver's logs:
> {noformat}
> 19/11/19 13:14:39 WARN BlockManagerMasterEndpoint: Error trying to remove 
> broadcast 4 from block manager BlockManagerId(10, 192.168.3.99, 39923, None)
> java.io.IOException: Failed to send RPC RPC 4999007301825869809 to 
> /10.65.55.240:14233: java.nio.channels.ClosedChannelException
>         at 
> org.apache.spark.network.client.TransportClient$RpcChannelListener.handleFailure(TransportClient.java:362)
>         at 
> org.apache.spark.network.client.TransportClient$StdChannelListener.operationComplete(TransportClient.java:339)
>         at 
> io.netty.util.concurrent.DefaultPromise.notifyListener0(DefaultPromise.java:577)
>         at 
> io.netty.util.concurrent.DefaultPromise.notifyListenersNow(DefaultPromise.java:551)
> {noformat}
> That happens every time some code calls into the block manager to request 
> stuff from all executors. Meaning that the dead executor re-registered, and 
> then was never removed from the block manager.
> I found a few races in the code that can lead to this situation. I'll post a 
> PR once I test it more.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-29965) Race in executor shutdown handling can lead to executor never fully unregistering

Reply via email to