Liran created SPARK-35625:
-----------------------------

             Summary: Spark on k8s zombie executors
                 Key: SPARK-35625
                 URL: https://issues.apache.org/jira/browse/SPARK-35625
             Project: Spark
          Issue Type: Bug
          Components: Kubernetes
    Affects Versions: 3.0.1
            Reporter: Liran
         Attachments: image-2021-06-03-12-15-57-095.png

We are running a POC of Spark on K8s setup for one of our apps, it's scaling 
up/down quite a lot, and we started noticing that after a while we get many of 
these sort of logs:
{code:java}
Error trying to remove broadcast 8805 from block manager BlockManagerId(79, 
10.244.248.23, 46681, None)
java.io.IOException: Failed to send RPC RPC 6006709312311899870 to 
/10.244.248.23:54004: io.netty.channel.StacklessClosedChannelException
{code}
{code:java}
Error trying to remove RDD 32952 from block manager BlockManagerId(79, 
10.244.248.23, 46681, None) java.io.IOException: Failed to send RPC RPC 
7506603739599355778 to /10.244.248.23:54004: 
io.netty.channel.StacklessClosedChannelException
{code}
 

All the errors/warn are related to trying to *remove* (shuffle/broadcast/rdd) 
files/blocks, which doesn't seems to harmful at this point other than spamming 
our logs.

 

The interesting part is that when looking in kubectl the executors doesn't 
seems to be alive (as expected), on the other hand in Spark UI, they do show up 
as "active" with 0 cores:

 

!image-2021-06-03-12-13-20-140.png!

 

!image-2021-06-03-12-00-04-471.png!

 

All the executors marked above are long dead, but for some reason the driver 
app still tries to send RPC requests to them.

 

According to our event logs, on of the pods was create at May 21 20:11 and was 
killed 9 min later at 20:20, but we are still seeing new logs on Jun 3.

!image-2021-06-03-12-07-29-032.png!

 

 

Sample of one of the errors:
{code:java}
Error trying to remove RDD 33178 from block manager BlockManagerId(79, 
10.244.248.23, 46681, None)Error trying to remove RDD 33178 from block manager 
BlockManagerId(79, 10.244.248.23, 46681, None)java.io.IOException: Failed to 
send RPC RPC 7684271332363250835 to /10.244.248.23:54004: 
io.netty.channel.StacklessClosedChannelException at 
org.apache.spark.network.client.TransportClient$RpcChannelListener.handleFailure(TransportClient.java:363)
 at 
org.apache.spark.network.client.TransportClient$StdChannelListener.operationComplete(TransportClient.java:340)
 at 
io.netty.util.concurrent.DefaultPromise.notifyListener0(DefaultPromise.java:577)
 at 
io.netty.util.concurrent.DefaultPromise.notifyListenersNow(DefaultPromise.java:551)
 at 
io.netty.util.concurrent.DefaultPromise.notifyListeners(DefaultPromise.java:490)
 at io.netty.util.concurrent.DefaultPromise.setValue0(DefaultPromise.java:615) 
at io.netty.util.concurrent.DefaultPromise.setFailure0(DefaultPromise.java:608) 
at io.netty.util.concurrent.DefaultPromise.tryFailure(DefaultPromise.java:117) 
at 
io.netty.channel.AbstractChannel$AbstractUnsafe.safeSetFailure(AbstractChannel.java:998)
 at 
io.netty.channel.AbstractChannel$AbstractUnsafe.write(AbstractChannel.java:866) 
at 
io.netty.channel.DefaultChannelPipeline$HeadContext.write(DefaultChannelPipeline.java:1367)
 at 
io.netty.channel.AbstractChannelHandlerContext.invokeWrite0(AbstractChannelHandlerContext.java:717)
 at 
io.netty.channel.AbstractChannelHandlerContext.invokeWriteAndFlush(AbstractChannelHandlerContext.java:764)
 at 
io.netty.channel.AbstractChannelHandlerContext$WriteTask.run(AbstractChannelHandlerContext.java:1071)
 at 
io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:164)
 at 
io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:472)
 at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:497) at 
io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989)
 at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) 
at 
io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
 at java.lang.Thread.run(Thread.java:748)Caused by: 
io.netty.channel.StacklessClosedChannelException at 
io.netty.channel.AbstractChannel$AbstractUnsafe.write(Object, 
ChannelPromise)(Unknown Source)  Failed to send RPC RPC 7684271332363250835 to 
/10.244.248.23:54004: 
io.netty.channel.StacklessClosedChannelExceptionio.netty.channel.StacklessClosedChannelException
 at io.netty.channel.AbstractChannel$AbstractUnsafe.write(Object, 
ChannelPromise)(Unknown Source)
{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to