Hadrien Kohl created SPARK-33325:
------------------------------------

             Summary: Spark executors pod are not shutting down when losing 
driver connection
                 Key: SPARK-33325
                 URL: https://issues.apache.org/jira/browse/SPARK-33325
             Project: Spark
          Issue Type: Bug
          Components: Kubernetes
    Affects Versions: 3.0.1
            Reporter: Hadrien Kohl


In situations where the executors lose contact with the driver, the java 
process does not die. I am looking at what on the kubernetes cluster could 
prevent proper clean-up. 

The spark driver is started in it's own pod in client mode (pyspark shell 
started by jupyter). I works fine most of the time but if the driver process 
crashes (OOM or kill signal for instance) the executor complains about the 
connection reset by peer and then hangs.

Here's the log from an executor pod that hangs:

 
{code:java}
20/11/03 07:35:30 WARN TransportChannelHandler: Exception in connection from 
/10.17.0.152:37161
java.io.IOException: Connection reset by peer
        at java.base/sun.nio.ch.FileDispatcherImpl.read0(Native Method)
        at java.base/sun.nio.ch.SocketDispatcher.read(Unknown Source)
        at java.base/sun.nio.ch.IOUtil.readIntoNativeBuffer(Unknown Source)
        at java.base/sun.nio.ch.IOUtil.read(Unknown Source)
        at java.base/sun.nio.ch.IOUtil.read(Unknown Source)
        at java.base/sun.nio.ch.SocketChannelImpl.read(Unknown Source)
        at io.netty.buffer.PooledByteBuf.setBytes(PooledByteBuf.java:253)
        at io.netty.buffer.AbstractByteBuf.writeBytes(AbstractByteBuf.java:1133)
        at 
io.netty.channel.socket.nio.NioSocketChannel.doReadBytes(NioSocketChannel.java:350)
        at 
io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:148)
        at 
io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:714)
        at 
io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:650)
        at 
io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:576)
        at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:493)
        at 
io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989)
        at 
io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
        at 
io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
        at java.base/java.lang.Thread.run(Unknown Source)
20/11/03 07:35:30 ERROR CoarseGrainedExecutorBackend: Executor self-exiting due 
to : Driver 10.17.0.152:37161 disassociated! Shutting down.
20/11/03 07:35:31 INFO MemoryStore: MemoryStore cleared
20/11/03 07:35:31 INFO BlockManager: BlockManager stopped

{code}
When start a shell in the pod I can see the process are still running: 

 

 
{code:java}
UID          PID    PPID  C    SZ   RSS PSR STIME TTY          TIME CMD
185          125       0  0  5045  3968   2 10:07 pts/0    00:00:00 /bin/bash
185          166     125  0  9019  3364   1 10:39 pts/0    00:00:00  \_ ps -AF 
--forest
185            1       0  0  1130   768   0 07:34 ?        00:00:00 
/usr/bin/tini -s -- /opt/java/openjdk/
185           14       1  0 1935527 493976 3 07:34 ?       00:00:21 
/opt/java/openjdk/bin/java -Dspark.dri
{code}
Here's the full command used to start the executor: 
{code:java}
/opt/java/openjdk/
bin/java -Dspark.driver.port=37161 -Xms4g -Xmx4g -cp :/opt/spark/jars/*: 
org.apache.spark.executor.CoarseG
rainedExecutorBackend --driver-url 
spark://CoarseGrainedScheduler@10.17.0.152:37161 --executor-id 1 --core
s 1 --app-id spark-application-1604388891044 --hostname 10.17.2.151
{code}
 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to