[ 
https://issues.apache.org/jira/browse/SPARK-35625?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Liran updated SPARK-35625:
--------------------------
    Description: 
We are running a POC of Spark on K8s setup for one of our apps, it's scaling 
up/down quite a lot, and we started noticing that after a while we get many of 
these sort of logs:
{code:java}
Error trying to remove broadcast 8805 from block manager BlockManagerId(79, 
10.244.248.23, 46681, None)
java.io.IOException: Failed to send RPC RPC 6006709312311899870 to 
/10.244.248.23:54004: io.netty.channel.StacklessClosedChannelException
{code}
{code:java}
Error trying to remove RDD 32952 from block manager BlockManagerId(79, 
10.244.248.23, 46681, None) java.io.IOException: Failed to send RPC RPC 
7506603739599355778 to /10.244.248.23:54004: 
io.netty.channel.StacklessClosedChannelException
{code}
 

All the errors/warn are related to trying to *remove* (shuffle/broadcast/rdd) 
files/blocks, which doesn't seems to harmful at this point other than spamming 
our logs.

 

The interesting part is that when looking in kubectl the executors doesn't 
seems to be alive (as expected), on the other hand in Spark UI, they do show up 
as "active" with 0 cores:

 
!image-2021-06-03-12-15-57-095.png!
 
!image-2021-06-03-12-16-03-621.png!
 

All the executors marked above are long dead, but for some reason the driver 
app still tries to send RPC requests to them.

 

According to our event logs, on of the pods was create at May 21 20:11 and was 
killed 9 min later at 20:20, but we are still seeing new logs on Jun 3.
!image-2021-06-03-12-16-12-573.png!
 

 

Sample of one of the errors:
{code:java}
Error trying to remove RDD 33178 from block manager BlockManagerId(79, 
10.244.248.23, 46681, None)Error trying to remove RDD 33178 from block manager 
BlockManagerId(79, 10.244.248.23, 46681, None)java.io.IOException: Failed to 
send RPC RPC 7684271332363250835 to /10.244.248.23:54004: 
io.netty.channel.StacklessClosedChannelException at 
org.apache.spark.network.client.TransportClient$RpcChannelListener.handleFailure(TransportClient.java:363)
 at 
org.apache.spark.network.client.TransportClient$StdChannelListener.operationComplete(TransportClient.java:340)
 at 
io.netty.util.concurrent.DefaultPromise.notifyListener0(DefaultPromise.java:577)
 at 
io.netty.util.concurrent.DefaultPromise.notifyListenersNow(DefaultPromise.java:551)
 at 
io.netty.util.concurrent.DefaultPromise.notifyListeners(DefaultPromise.java:490)
 at io.netty.util.concurrent.DefaultPromise.setValue0(DefaultPromise.java:615) 
at io.netty.util.concurrent.DefaultPromise.setFailure0(DefaultPromise.java:608) 
at io.netty.util.concurrent.DefaultPromise.tryFailure(DefaultPromise.java:117) 
at 
io.netty.channel.AbstractChannel$AbstractUnsafe.safeSetFailure(AbstractChannel.java:998)
 at 
io.netty.channel.AbstractChannel$AbstractUnsafe.write(AbstractChannel.java:866) 
at 
io.netty.channel.DefaultChannelPipeline$HeadContext.write(DefaultChannelPipeline.java:1367)
 at 
io.netty.channel.AbstractChannelHandlerContext.invokeWrite0(AbstractChannelHandlerContext.java:717)
 at 
io.netty.channel.AbstractChannelHandlerContext.invokeWriteAndFlush(AbstractChannelHandlerContext.java:764)
 at 
io.netty.channel.AbstractChannelHandlerContext$WriteTask.run(AbstractChannelHandlerContext.java:1071)
 at 
io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:164)
 at 
io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:472)
 at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:497) at 
io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989)
 at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) 
at 
io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
 at java.lang.Thread.run(Thread.java:748)Caused by: 
io.netty.channel.StacklessClosedChannelException at 
io.netty.channel.AbstractChannel$AbstractUnsafe.write(Object, 
ChannelPromise)(Unknown Source)  Failed to send RPC RPC 7684271332363250835 to 
/10.244.248.23:54004: 
io.netty.channel.StacklessClosedChannelExceptionio.netty.channel.StacklessClosedChannelException
 at io.netty.channel.AbstractChannel$AbstractUnsafe.write(Object, 
ChannelPromise)(Unknown Source)
{code}

  was:
We are running a POC of Spark on K8s setup for one of our apps, it's scaling 
up/down quite a lot, and we started noticing that after a while we get many of 
these sort of logs:
{code:java}
Error trying to remove broadcast 8805 from block manager BlockManagerId(79, 
10.244.248.23, 46681, None)
java.io.IOException: Failed to send RPC RPC 6006709312311899870 to 
/10.244.248.23:54004: io.netty.channel.StacklessClosedChannelException
{code}
{code:java}
Error trying to remove RDD 32952 from block manager BlockManagerId(79, 
10.244.248.23, 46681, None) java.io.IOException: Failed to send RPC RPC 
7506603739599355778 to /10.244.248.23:54004: 
io.netty.channel.StacklessClosedChannelException
{code}
 

All the errors/warn are related to trying to *remove* (shuffle/broadcast/rdd) 
files/blocks, which doesn't seems to harmful at this point other than spamming 
our logs.

 

The interesting part is that when looking in kubectl the executors doesn't 
seems to be alive (as expected), on the other hand in Spark UI, they do show up 
as "active" with 0 cores:

 

!image-2021-06-03-12-13-20-140.png!

 

!image-2021-06-03-12-00-04-471.png!

 

All the executors marked above are long dead, but for some reason the driver 
app still tries to send RPC requests to them.

 

According to our event logs, on of the pods was create at May 21 20:11 and was 
killed 9 min later at 20:20, but we are still seeing new logs on Jun 3.

!image-2021-06-03-12-07-29-032.png!

 

 

Sample of one of the errors:
{code:java}
Error trying to remove RDD 33178 from block manager BlockManagerId(79, 
10.244.248.23, 46681, None)Error trying to remove RDD 33178 from block manager 
BlockManagerId(79, 10.244.248.23, 46681, None)java.io.IOException: Failed to 
send RPC RPC 7684271332363250835 to /10.244.248.23:54004: 
io.netty.channel.StacklessClosedChannelException at 
org.apache.spark.network.client.TransportClient$RpcChannelListener.handleFailure(TransportClient.java:363)
 at 
org.apache.spark.network.client.TransportClient$StdChannelListener.operationComplete(TransportClient.java:340)
 at 
io.netty.util.concurrent.DefaultPromise.notifyListener0(DefaultPromise.java:577)
 at 
io.netty.util.concurrent.DefaultPromise.notifyListenersNow(DefaultPromise.java:551)
 at 
io.netty.util.concurrent.DefaultPromise.notifyListeners(DefaultPromise.java:490)
 at io.netty.util.concurrent.DefaultPromise.setValue0(DefaultPromise.java:615) 
at io.netty.util.concurrent.DefaultPromise.setFailure0(DefaultPromise.java:608) 
at io.netty.util.concurrent.DefaultPromise.tryFailure(DefaultPromise.java:117) 
at 
io.netty.channel.AbstractChannel$AbstractUnsafe.safeSetFailure(AbstractChannel.java:998)
 at 
io.netty.channel.AbstractChannel$AbstractUnsafe.write(AbstractChannel.java:866) 
at 
io.netty.channel.DefaultChannelPipeline$HeadContext.write(DefaultChannelPipeline.java:1367)
 at 
io.netty.channel.AbstractChannelHandlerContext.invokeWrite0(AbstractChannelHandlerContext.java:717)
 at 
io.netty.channel.AbstractChannelHandlerContext.invokeWriteAndFlush(AbstractChannelHandlerContext.java:764)
 at 
io.netty.channel.AbstractChannelHandlerContext$WriteTask.run(AbstractChannelHandlerContext.java:1071)
 at 
io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:164)
 at 
io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:472)
 at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:497) at 
io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989)
 at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) 
at 
io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
 at java.lang.Thread.run(Thread.java:748)Caused by: 
io.netty.channel.StacklessClosedChannelException at 
io.netty.channel.AbstractChannel$AbstractUnsafe.write(Object, 
ChannelPromise)(Unknown Source)  Failed to send RPC RPC 7684271332363250835 to 
/10.244.248.23:54004: 
io.netty.channel.StacklessClosedChannelExceptionio.netty.channel.StacklessClosedChannelException
 at io.netty.channel.AbstractChannel$AbstractUnsafe.write(Object, 
ChannelPromise)(Unknown Source)
{code}


> Spark on k8s zombie executors
> -----------------------------
>
>                 Key: SPARK-35625
>                 URL: https://issues.apache.org/jira/browse/SPARK-35625
>             Project: Spark
>          Issue Type: Bug
>          Components: Kubernetes
>    Affects Versions: 3.0.1
>            Reporter: Liran
>            Priority: Major
>         Attachments: image-2021-06-03-12-15-57-095.png, 
> image-2021-06-03-12-16-03-621.png, image-2021-06-03-12-16-12-573.png
>
>
> We are running a POC of Spark on K8s setup for one of our apps, it's scaling 
> up/down quite a lot, and we started noticing that after a while we get many 
> of these sort of logs:
> {code:java}
> Error trying to remove broadcast 8805 from block manager BlockManagerId(79, 
> 10.244.248.23, 46681, None)
> java.io.IOException: Failed to send RPC RPC 6006709312311899870 to 
> /10.244.248.23:54004: io.netty.channel.StacklessClosedChannelException
> {code}
> {code:java}
> Error trying to remove RDD 32952 from block manager BlockManagerId(79, 
> 10.244.248.23, 46681, None) java.io.IOException: Failed to send RPC RPC 
> 7506603739599355778 to /10.244.248.23:54004: 
> io.netty.channel.StacklessClosedChannelException
> {code}
>  
> All the errors/warn are related to trying to *remove* (shuffle/broadcast/rdd) 
> files/blocks, which doesn't seems to harmful at this point other than 
> spamming our logs.
>  
> The interesting part is that when looking in kubectl the executors doesn't 
> seems to be alive (as expected), on the other hand in Spark UI, they do show 
> up as "active" with 0 cores:
>  
> !image-2021-06-03-12-15-57-095.png!
>  
> !image-2021-06-03-12-16-03-621.png!
>  
> All the executors marked above are long dead, but for some reason the driver 
> app still tries to send RPC requests to them.
>  
> According to our event logs, on of the pods was create at May 21 20:11 and 
> was killed 9 min later at 20:20, but we are still seeing new logs on Jun 3.
> !image-2021-06-03-12-16-12-573.png!
>  
>  
> Sample of one of the errors:
> {code:java}
> Error trying to remove RDD 33178 from block manager BlockManagerId(79, 
> 10.244.248.23, 46681, None)Error trying to remove RDD 33178 from block 
> manager BlockManagerId(79, 10.244.248.23, 46681, None)java.io.IOException: 
> Failed to send RPC RPC 7684271332363250835 to /10.244.248.23:54004: 
> io.netty.channel.StacklessClosedChannelException at 
> org.apache.spark.network.client.TransportClient$RpcChannelListener.handleFailure(TransportClient.java:363)
>  at 
> org.apache.spark.network.client.TransportClient$StdChannelListener.operationComplete(TransportClient.java:340)
>  at 
> io.netty.util.concurrent.DefaultPromise.notifyListener0(DefaultPromise.java:577)
>  at 
> io.netty.util.concurrent.DefaultPromise.notifyListenersNow(DefaultPromise.java:551)
>  at 
> io.netty.util.concurrent.DefaultPromise.notifyListeners(DefaultPromise.java:490)
>  at 
> io.netty.util.concurrent.DefaultPromise.setValue0(DefaultPromise.java:615) at 
> io.netty.util.concurrent.DefaultPromise.setFailure0(DefaultPromise.java:608) 
> at 
> io.netty.util.concurrent.DefaultPromise.tryFailure(DefaultPromise.java:117) 
> at 
> io.netty.channel.AbstractChannel$AbstractUnsafe.safeSetFailure(AbstractChannel.java:998)
>  at 
> io.netty.channel.AbstractChannel$AbstractUnsafe.write(AbstractChannel.java:866)
>  at 
> io.netty.channel.DefaultChannelPipeline$HeadContext.write(DefaultChannelPipeline.java:1367)
>  at 
> io.netty.channel.AbstractChannelHandlerContext.invokeWrite0(AbstractChannelHandlerContext.java:717)
>  at 
> io.netty.channel.AbstractChannelHandlerContext.invokeWriteAndFlush(AbstractChannelHandlerContext.java:764)
>  at 
> io.netty.channel.AbstractChannelHandlerContext$WriteTask.run(AbstractChannelHandlerContext.java:1071)
>  at 
> io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:164)
>  at 
> io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:472)
>  at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:497) at 
> io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989)
>  at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) 
> at 
> io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
>  at java.lang.Thread.run(Thread.java:748)Caused by: 
> io.netty.channel.StacklessClosedChannelException at 
> io.netty.channel.AbstractChannel$AbstractUnsafe.write(Object, 
> ChannelPromise)(Unknown Source)  Failed to send RPC RPC 7684271332363250835 
> to /10.244.248.23:54004: 
> io.netty.channel.StacklessClosedChannelExceptionio.netty.channel.StacklessClosedChannelException
>  at io.netty.channel.AbstractChannel$AbstractUnsafe.write(Object, 
> ChannelPromise)(Unknown Source)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to