[ https://issues.apache.org/jira/browse/SPARK-29965?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17157936#comment-17157936 ]
Jeroen Dries commented on SPARK-29965: -------------------------------------- We seem to see the same (or similar?) issue on Spark 2.4.4. We have a long running application, running on a yarn (hortonworks 3.1.4.0-315) cluster with dynamic allocation and preemption enabled. After a fresh restart, logs are fine, but a few days later, our logs are being spammed with messages like this: {code:java} 20/07/15 08:34:10 WARN BlockManagerMasterEndpoint: Error trying to remove broadcast 258143 from block manager BlockManagerId(10469, epod036.vgt.vito.be, 35676, None) java.io.IOException: Failed to send RPC RPC 8615372230556941680 to /192.168.207.136:51296: java.nio.channels.ClosedChannelException at org.apache.spark.network.client.TransportClient$RpcChannelListener.handleFailure(TransportClient.java:362) at org.apache.spark.network.client.TransportClient$StdChannelListener.operationComplete(TransportClient.java:339) at io.netty.util.concurrent.DefaultPromise.notifyListener0(DefaultPromise.java:507) at io.netty.util.concurrent.DefaultPromise.notifyListenersNow(DefaultPromise.java:481) at io.netty.util.concurrent.DefaultPromise.notifyListeners(DefaultPromise.java:420) at io.netty.util.concurrent.DefaultPromise.tryFailure(DefaultPromise.java:122) at io.netty.channel.AbstractChannel$AbstractUnsafe.safeSetFailure(AbstractChannel.java:987) at io.netty.channel.AbstractChannel$AbstractUnsafe.write(AbstractChannel.java:869) at io.netty.channel.DefaultChannelPipeline$HeadContext.write(DefaultChannelPipeline.java:1316) at io.netty.channel.AbstractChannelHandlerContext.invokeWrite0(AbstractChannelHandlerContext.java:738) at io.netty.channel.AbstractChannelHandlerContext.invokeWrite(AbstractChannelHandlerContext.java:730) at io.netty.channel.AbstractChannelHandlerContext.access$1900(AbstractChannelHandlerContext.java:38) at io.netty.channel.AbstractChannelHandlerContext$AbstractWriteTask.write(AbstractChannelHandlerContext.java:1081) at io.netty.channel.AbstractChannelHandlerContext$WriteAndFlushTask.write(AbstractChannelHandlerContext.java:1128) at io.netty.channel.AbstractChannelHandlerContext$AbstractWriteTask.run(AbstractChannelHandlerContext.java:1070) at io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:163) at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:403) at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:463) at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:858) at io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:138) at java.lang.Thread.run(Thread.java:748) Caused by: java.nio.channels.ClosedChannelException at io.netty.channel.AbstractChannel$AbstractUnsafe.write(...)(Unknown Source) {code} > Race in executor shutdown handling can lead to executor never fully > unregistering > --------------------------------------------------------------------------------- > > Key: SPARK-29965 > URL: https://issues.apache.org/jira/browse/SPARK-29965 > Project: Spark > Issue Type: Bug > Components: Spark Core > Affects Versions: 3.0.0 > Reporter: Marcelo Masiero Vanzin > Priority: Major > > I ran into a situation that I had never noticed before, but I seem to be able > to hit with just a few retries when using K8S with dynamic allocation. > Basically, there's a race when killing an executor, where it may send a > heartbeat to the driver right at the wrong time during shutdown, e.g.: > {noformat} > 19/11/19 21:14:05 INFO CoarseGrainedExecutorBackend: Driver commanded a > shutdown > 19/11/19 21:14:05 INFO Executor: Told to re-register on heartbeat > 19/11/19 21:14:05 INFO BlockManager: BlockManager BlockManagerId(10, > 192.168.3.99, 39923, None) re-registering with master > 19/11/19 21:14:05 INFO BlockManagerMaster: Registering BlockManager > BlockManagerId(10, 192.168.3.99, 39923, None) > 19/11/19 21:14:05 INFO BlockManagerMaster: Registered BlockManager > BlockManagerId(10, 192.168.3.99, 39923, None) > 19/11/19 21:14:06 INFO BlockManager: Reporting 0 blocks to the master. > {noformat} > On the driver side it will happily re-register the executor (time diff is > just because of time zone in log4j config): > {noformat} > 19/11/19 13:14:05 INFO BlockManagerMasterEndpoint: Trying to remove executor > 10 from BlockManagerMaster. > 19/11/19 13:14:05 INFO BlockManagerMasterEndpoint: Removing block manager > BlockManagerId(10, 192.168.3.99, 39923, None) > 19/11/19 13:14:05 INFO BlockManagerMaster: Removed 10 successfully in > removeExecutor > 19/11/19 13:14:05 INFO DAGScheduler: Shuffle files lost for executor: 10 > (epoch 18) > {noformat} > And a little later: > {noformat} > 19/11/19 13:14:05 DEBUG HeartbeatReceiver: Received heartbeat from unknown > executor 10 > 19/11/19 13:14:05 INFO BlockManagerMasterEndpoint: Registering block manager > 192.168.3.99:39923 with 413.9 MiB RAM, BlockManagerId(10, 192.168.3.99, > 39923, None) > {noformat} > This becomes a problem later, where you start to see period exceptions in the > driver's logs: > {noformat} > 19/11/19 13:14:39 WARN BlockManagerMasterEndpoint: Error trying to remove > broadcast 4 from block manager BlockManagerId(10, 192.168.3.99, 39923, None) > java.io.IOException: Failed to send RPC RPC 4999007301825869809 to > /10.65.55.240:14233: java.nio.channels.ClosedChannelException > at > org.apache.spark.network.client.TransportClient$RpcChannelListener.handleFailure(TransportClient.java:362) > at > org.apache.spark.network.client.TransportClient$StdChannelListener.operationComplete(TransportClient.java:339) > at > io.netty.util.concurrent.DefaultPromise.notifyListener0(DefaultPromise.java:577) > at > io.netty.util.concurrent.DefaultPromise.notifyListenersNow(DefaultPromise.java:551) > {noformat} > That happens every time some code calls into the block manager to request > stuff from all executors. Meaning that the dead executor re-registered, and > then was never removed from the block manager. > I found a few races in the code that can lead to this situation. I'll post a > PR once I test it more. -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org