Hadrien Kohl created SPARK-33325: ------------------------------------ Summary: Spark executors pod are not shutting down when losing driver connection Key: SPARK-33325 URL: https://issues.apache.org/jira/browse/SPARK-33325 Project: Spark Issue Type: Bug Components: Kubernetes Affects Versions: 3.0.1 Reporter: Hadrien Kohl
In situations where the executors lose contact with the driver, the java process does not die. I am looking at what on the kubernetes cluster could prevent proper clean-up. The spark driver is started in it's own pod in client mode (pyspark shell started by jupyter). I works fine most of the time but if the driver process crashes (OOM or kill signal for instance) the executor complains about the connection reset by peer and then hangs. Here's the log from an executor pod that hangs: {code:java} 20/11/03 07:35:30 WARN TransportChannelHandler: Exception in connection from /10.17.0.152:37161 java.io.IOException: Connection reset by peer at java.base/sun.nio.ch.FileDispatcherImpl.read0(Native Method) at java.base/sun.nio.ch.SocketDispatcher.read(Unknown Source) at java.base/sun.nio.ch.IOUtil.readIntoNativeBuffer(Unknown Source) at java.base/sun.nio.ch.IOUtil.read(Unknown Source) at java.base/sun.nio.ch.IOUtil.read(Unknown Source) at java.base/sun.nio.ch.SocketChannelImpl.read(Unknown Source) at io.netty.buffer.PooledByteBuf.setBytes(PooledByteBuf.java:253) at io.netty.buffer.AbstractByteBuf.writeBytes(AbstractByteBuf.java:1133) at io.netty.channel.socket.nio.NioSocketChannel.doReadBytes(NioSocketChannel.java:350) at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:148) at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:714) at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:650) at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:576) at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:493) at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989) at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) at java.base/java.lang.Thread.run(Unknown Source) 20/11/03 07:35:30 ERROR CoarseGrainedExecutorBackend: Executor self-exiting due to : Driver 10.17.0.152:37161 disassociated! Shutting down. 20/11/03 07:35:31 INFO MemoryStore: MemoryStore cleared 20/11/03 07:35:31 INFO BlockManager: BlockManager stopped {code} When start a shell in the pod I can see the process are still running: {code:java} UID PID PPID C SZ RSS PSR STIME TTY TIME CMD 185 125 0 0 5045 3968 2 10:07 pts/0 00:00:00 /bin/bash 185 166 125 0 9019 3364 1 10:39 pts/0 00:00:00 \_ ps -AF --forest 185 1 0 0 1130 768 0 07:34 ? 00:00:00 /usr/bin/tini -s -- /opt/java/openjdk/ 185 14 1 0 1935527 493976 3 07:34 ? 00:00:21 /opt/java/openjdk/bin/java -Dspark.dri {code} Here's the full command used to start the executor: {code:java} /opt/java/openjdk/ bin/java -Dspark.driver.port=37161 -Xms4g -Xmx4g -cp :/opt/spark/jars/*: org.apache.spark.executor.CoarseG rainedExecutorBackend --driver-url spark://CoarseGrainedScheduler@10.17.0.152:37161 --executor-id 1 --core s 1 --app-id spark-application-1604388891044 --hostname 10.17.2.151 {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org