[ https://issues.apache.org/jira/browse/SPARK-33325?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17453536#comment-17453536 ]
wineternity edited comment on SPARK-33325 at 12/5/21, 7:20 AM: --------------------------------------------------------------- Fixed in https://issues.apache.org/jira/browse/SPARK-36532 was (Author: yimo_yym): Fixed in https://issues.apache.org/jira/browse/SPARK-36532 > Spark executors pod are not shutting down when losing driver connection > ----------------------------------------------------------------------- > > Key: SPARK-33325 > URL: https://issues.apache.org/jira/browse/SPARK-33325 > Project: Spark > Issue Type: Bug > Components: Kubernetes > Affects Versions: 3.0.1 > Reporter: Hadrien Kohl > Priority: Major > > In situations where the executors lose contact with the driver, the java > process does not die. I am looking at what on the kubernetes cluster could > prevent proper clean-up. > The spark driver is started in it's own pod in client mode (pyspark shell > started by jupyter). I works fine most of the time but if the driver process > crashes (OOM or kill signal for instance) the executor complains about the > connection reset by peer and then hangs. > Here's the log from an executor pod that hangs: > {code:java} > 20/11/03 07:35:30 WARN TransportChannelHandler: Exception in connection from > /10.17.0.152:37161 > java.io.IOException: Connection reset by peer > at java.base/sun.nio.ch.FileDispatcherImpl.read0(Native Method) > at java.base/sun.nio.ch.SocketDispatcher.read(Unknown Source) > at java.base/sun.nio.ch.IOUtil.readIntoNativeBuffer(Unknown Source) > at java.base/sun.nio.ch.IOUtil.read(Unknown Source) > at java.base/sun.nio.ch.IOUtil.read(Unknown Source) > at java.base/sun.nio.ch.SocketChannelImpl.read(Unknown Source) > at io.netty.buffer.PooledByteBuf.setBytes(PooledByteBuf.java:253) > at io.netty.buffer.AbstractByteBuf.writeBytes(AbstractByteBuf.java:1133) > at > io.netty.channel.socket.nio.NioSocketChannel.doReadBytes(NioSocketChannel.java:350) > at > io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:148) > at > io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:714) > at > io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:650) > at > io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:576) > at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:493) > at > io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989) > at > io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) > at > io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) > at java.base/java.lang.Thread.run(Unknown Source) > 20/11/03 07:35:30 ERROR CoarseGrainedExecutorBackend: Executor self-exiting > due to : Driver 10.17.0.152:37161 disassociated! Shutting down. > 20/11/03 07:35:31 INFO MemoryStore: MemoryStore cleared > 20/11/03 07:35:31 INFO BlockManager: BlockManager stopped > {code} > When start a shell in the pod I can see the process are still running: > {code:java} > UID PID PPID C SZ RSS PSR STIME TTY TIME CMD > 185 125 0 0 5045 3968 2 10:07 pts/0 00:00:00 /bin/bash > 185 166 125 0 9019 3364 1 10:39 pts/0 00:00:00 \_ ps > -AF --forest > 185 1 0 0 1130 768 0 07:34 ? 00:00:00 > /usr/bin/tini -s -- /opt/java/openjdk/ > 185 14 1 0 1935527 493976 3 07:34 ? 00:00:21 > /opt/java/openjdk/bin/java -Dspark.dri > {code} > Here's the full command used to start the executor: > {code:java} > /opt/java/openjdk/ > bin/java -Dspark.driver.port=37161 -Xms4g -Xmx4g -cp :/opt/spark/jars/*: > org.apache.spark.executor.CoarseG > rainedExecutorBackend --driver-url > spark://CoarseGrainedScheduler@10.17.0.152:37161 --executor-id 1 --core > s 1 --app-id spark-application-1604388891044 --hostname 10.17.2.151 > {code} > -- This message was sent by Atlassian Jira (v8.20.1#820001) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org