[ https://issues.apache.org/jira/browse/SPARK-32969?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
El R updated SPARK-32969: ------------------------- Affects Version/s: (was: 3.0.1) > Spark Submit process not exiting after session.stop() > ----------------------------------------------------- > > Key: SPARK-32969 > URL: https://issues.apache.org/jira/browse/SPARK-32969 > Project: Spark > Issue Type: Bug > Components: PySpark, Spark Submit > Affects Versions: 2.4.7 > Reporter: El R > Priority: Critical > > Exactly 3 spark submit processes are hanging from the first 3 jobs that were > submitted to the standalone cluster using client mode. Example from the > client: > {code:java} > root 1517 0.3 4.7 8412728 1532876 ? Sl 18:49 0:38 > /usr/lib/jvm/java-8-openjdk-amd64/jre/bin/java -cp > /usr/local/spark/conf/:/usr/local/spark/jars/*:/usr/local/hadoop-2.7.7/etc/hadoop/:/usr/local/hadoop-2.7.7/share/hadoop/common/lib/*:/usr/local/hadoop-2.7.7/share/hadoop/common/*:/usr/local/hadoop-2.7.7/share/hadoop/hdfs/:/usr/local/hadoop-2.7.7/share/hadoop/hdfs/lib/*:/usr/local/hadoop-2.7.7/share/hadoop/hdfs/*:/usr/local/hadoop-2.7.7/share/hadoop/yarn/lib/*:/usr/local/hadoop-2.7.7/share/hadoop/yarn/*:/usr/local/hadoop-2.7.7/share/hadoop/mapreduce/lib/*:/usr/local/hadoop-2.7.7/share/hadoop/mapreduce/*:/usr/local/hadoop/contrib/capacity-scheduler/*.jar > -Xmx2g org.apache.spark.deploy.SparkSubmit --conf spark.driver.port=46101 > --conf spark.master=spark://3c520b0c6d6e:7077 --conf > spark.scheduler.allocation.file=/home/jovyan/work/spark_scheduler_allocation.xml > --conf spark.app.name=REDACTED --conf spark.driver.bindAddress=3c520b0c6d6e > --conf spark.fileserver.port=46102 --conf > packages=org.apache.kudu:kudu-spark2_2.11:1.12.0 --conf > spark.broadcast.port=46103 --conf spark.driver.host=3c520b0c6d6e --conf > spark.replClassServer.port=46104 --conf > spark.executorEnv.AF_ALERTS_STREAM_KEY=ALERTS_STREAM_LIST --conf > spark.scheduler.mode=FAIR --conf spark.shuffle.service.enabled=True --conf > spark.blockManager.port=46105 --conf spark.dynamicAllocation.enabled=true > pyspark-shell > root 1746 0.4 3.5 8152640 1132420 ? Sl 18:59 0:36 > /usr/lib/jvm/java-8-openjdk-amd64/jre/bin/java -cp > /usr/local/spark/conf/:/usr/local/spark/jars/*:/usr/local/hadoop-2.7.7/etc/hadoop/:/usr/local/hadoop-2.7.7/share/hadoop/common/lib/*:/usr/local/hadoop-2.7.7/share/hadoop/common/*:/usr/local/hadoop-2.7.7/share/hadoop/hdfs/:/usr/local/hadoop-2.7.7/share/hadoop/hdfs/lib/*:/usr/local/hadoop-2.7.7/share/hadoop/hdfs/*:/usr/local/hadoop-2.7.7/share/hadoop/yarn/lib/*:/usr/local/hadoop-2.7.7/share/hadoop/yarn/*:/usr/local/hadoop-2.7.7/share/hadoop/mapreduce/lib/*:/usr/local/hadoop-2.7.7/share/hadoop/mapreduce/*:/usr/local/hadoop/contrib/capacity-scheduler/*.jar > -Xmx2g org.apache.spark.deploy.SparkSubmit --conf spark.driver.port=46101 > --conf spark.master=spark://3c520b0c6d6e:7077 --conf > spark.scheduler.allocation.file=/home/jovyan/work/spark_scheduler_allocation.xml > --conf spark.app.name=REDACTED --conf spark.driver.bindAddress=3c520b0c6d6e > --conf spark.fileserver.port=46102 --conf > packages=org.apache.kudu:kudu-spark2_2.11:1.12.0 --conf > spark.broadcast.port=46103 --conf spark.driver.host=3c520b0c6d6e --conf > spark.replClassServer.port=46104 --conf > spark.executorEnv.AF_ALERTS_STREAM_KEY=ALERTS_STREAM_LIST --conf > spark.scheduler.mode=FAIR --conf spark.shuffle.service.enabled=True --conf > spark.blockManager.port=46105 --conf spark.dynamicAllocation.enabled=true > pyspark-shell > root 2239 65.3 7.8 9743456 2527236 ? Sl 19:10 91:30 > /usr/lib/jvm/java-8-openjdk-amd64/jre/bin/java -cp > /usr/local/spark/conf/:/usr/local/spark/jars/*:/usr/local/hadoop-2.7.7/etc/hadoop/:/usr/local/hadoop-2.7.7/share/hadoop/common/lib/*:/usr/local/hadoop-2.7.7/share/hadoop/common/*:/usr/local/hadoop-2.7.7/share/hadoop/hdfs/:/usr/local/hadoop-2.7.7/share/hadoop/hdfs/lib/*:/usr/local/hadoop-2.7.7/share/hadoop/hdfs/*:/usr/local/hadoop-2.7.7/share/hadoop/yarn/lib/*:/usr/local/hadoop-2.7.7/share/hadoop/yarn/*:/usr/local/hadoop-2.7.7/share/hadoop/mapreduce/lib/*:/usr/local/hadoop-2.7.7/share/hadoop/mapreduce/*:/usr/local/hadoop/contrib/capacity-scheduler/*.jar > -Xmx2g org.apache.spark.deploy.SparkSubmit --conf spark.driver.port=46101 > --conf spark.master=spark://3c520b0c6d6e:7077 --conf > spark.scheduler.allocation.file=/home/jovyan/work/spark_scheduler_allocation.xml > --conf spark.app.name=REDACTED --conf spark.driver.bindAddress=3c520b0c6d6e > --conf spark.fileserver.port=46102 --conf > packages=org.apache.kudu:kudu-spark2_2.11:1.12.0 --conf > spark.broadcast.port=46103 --conf spark.driver.host=3c520b0c6d6e --conf > spark.replClassServer.port=46104 --conf > spark.executorEnv.AF_ALERTS_STREAM_KEY=ALERTS_STREAM_LIST --conf > spark.scheduler.mode=FAIR --conf spark.shuffle.service.enabled=True --conf > spark.blockManager.port=46105 --conf spark.dynamicAllocation.enabled=true > pyspark-shell > > {code} > The corresponding jobs are showing as 'completed' in spark UI and have closed > their sessions & exited according to their logs. No worker resources are > being consumed by these jobs anymore & subsequent jobs are able to receive > maximum executors and run as expected. However, these 3 processes keep > consuming memory in a slowly growing rate, which eventually leads to an OOM > when trying to allocate a new driver. > The thread list of process 1517 above is showing the following user threads > (daemon threads omitted): > {code:java} > "Thread-4" #16 prio=5 os_prio=0 tid=0x00007f8fe4008000 nid=0x61c runnable > [0x00007f9029227000] java.lang.Thread.State: RUNNABLE `at > java.net.SocketInputStream.socketRead0(Native Method)` `at > java.net.SocketInputStream.socketRead(`[`SocketInputStream.java:116`](https://SocketInputStream.java:116)`)` > `at` > [`java.net.SocketInputStream.read`](https://java.net.SocketInputStream.read)`(`[`SocketInputStream.java:171`](https://SocketInputStream.java:171)`)` > `at` > [`java.net.SocketInputStream.read`](https://java.net.SocketInputStream.read)`(`[`SocketInputStream.java:141`](https://SocketInputStream.java:141)`)` > `at > sun.nio.cs.StreamDecoder.readBytes(`[`StreamDecoder.java:284`](https://StreamDecoder.java:284)`)` > `at > sun.nio.cs.StreamDecoder.implRead(`[`StreamDecoder.java:326`](https://StreamDecoder.java:326)`)` > `at` > [`sun.nio.cs.StreamDecoder.read`](https://sun.nio.cs.StreamDecoder.read)`(`[`StreamDecoder.java:178`](https://StreamDecoder.java:178)`)` > `- locked <0x00000000800f8a88> (a java.io.InputStreamReader)` `at` > [`java.io.InputStreamReader.read`](https://java.io.InputStreamReader.read)`(`[`InputStreamReader.java:184`](https://InputStreamReader.java:184)`)` > `at > java.io.BufferedReader.fill(`[`BufferedReader.java:161`](https://BufferedReader.java:161)`)` > `at > java.io.BufferedReader.readLine(`[`BufferedReader.java:324`](https://BufferedReader.java:324)`)` > `- locked <0x00000000800f8a88> (a java.io.InputStreamReader)` `at > java.io.BufferedReader.readLine(`[`BufferedReader.java:389`](https://BufferedReader.java:389)`)` > `at` > [`py4j.GatewayConnection.run`](https://py4j.GatewayConnection.run)`(`[`GatewayConnection.java:230`](https://GatewayConnection.java:230)`)` > `at` > [`java.lang.Thread.run`](https://java.lang.Thread.run)`(`[`Thread.java:748`](https://Thread.java:748)`)` > Locked ownable synchronizers: `- None` > "Thread-3" #15 prio=5 os_prio=0 tid=0x00007f905dab7000 nid=0x61b runnable > [0x00007f9029328000] java.lang.Thread.State: RUNNABLE `at > java.net.PlainSocketImpl.socketAccept(Native Method)` `at > java.net.AbstractPlainSocketImpl.accept(`[`AbstractPlainSocketImpl.java:409`](https://AbstractPlainSocketImpl.java:409)`)` > `at > java.net.ServerSocket.implAccept(`[`ServerSocket.java:560`](https://ServerSocket.java:560)`)` > `at > java.net.ServerSocket.accept(`[`ServerSocket.java:528`](https://ServerSocket.java:528)`)` > `at` > [`py4j.GatewayServer.run`](https://py4j.GatewayServer.run)`(`[`GatewayServer.java:685`](https://GatewayServer.java:685)`)` > `at` > [`java.lang.Thread.run`](https://java.lang.Thread.run)`(`[`Thread.java:748`](https://Thread.java:748)`)` > Locked ownable synchronizers: `- None` > "pool-1-thread-1" #14 prio=5 os_prio=0 tid=0x00007f905daa5000 nid=0x61a > waiting on condition [0x00007f902982c000] java.lang.Thread.State: > TIMED_WAITING (parking) `at sun.misc.Unsafe.park(Native Method)` `- parking > to wait for <0x000000008011cda8> (a > java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)` `at > java.util.concurrent.locks.LockSupport.parkNanos(`[`LockSupport.java:215`](https://LockSupport.java:215)`)` > `at > java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(`[`AbstractQueuedSynchronizer.java:2078`](https://AbstractQueuedSynchronizer.java:2078)`)` > `at > java.util.concurrent.ScheduledThreadPoolExecutor$DelayedWorkQueue.take(`[`ScheduledThreadPoolExecutor.java:1093`](https://ScheduledThreadPoolExecutor.java:1093)`)` > `at > java.util.concurrent.ScheduledThreadPoolExecutor$DelayedWorkQueue.take(`[`ScheduledThreadPoolExecutor.java:809`](https://ScheduledThreadPoolExecutor.java:809)`)` > `at > java.util.concurrent.ThreadPoolExecutor.getTask(`[`ThreadPoolExecutor.java:1074`](https://ThreadPoolExecutor.java:1074)`)` > `at > java.util.concurrent.ThreadPoolExecutor.runWorker(`[`ThreadPoolExecutor.java:1134`](https://ThreadPoolExecutor.java:1134)`)` > `at` > [`java.util.concurrent.ThreadPoolExecutor$Worker.run`](https://java.util.concurrent.ThreadPoolExecutor$Worker.run)`(`[`ThreadPoolExecutor.java:624`](https://ThreadPoolExecutor.java:624)`)` > `at` > [`java.lang.Thread.run`](https://java.lang.Thread.run)`(`[`Thread.java:748`](https://Thread.java:748)`)` > Locked ownable synchronizers: `- None` > "main" #1 prio=5 os_prio=0 tid=0x00007f905c016800 nid=0x604 runnable > [0x00007f9062b96000] java.lang.Thread.State: RUNNABLE `at > java.io.FileInputStream.readBytes(Native Method)` `at` > [`java.io.FileInputStream.read`](https://java.io.FileInputStream.read)`(`[`FileInputStream.java:255`](https://FileInputStream.java:255)`)` > `at > java.io.BufferedInputStream.fill(`[`BufferedInputStream.java:246`](https://BufferedInputStream.java:246)`)` > `at` > [`java.io.BufferedInputStream.read`](https://java.io.BufferedInputStream.read)`(`[`BufferedInputStream.java:265`](https://BufferedInputStream.java:265)`)` > `- locked <0x0000000080189dc8> (a java.io.BufferedInputStream)` `at > org.apache.spark.api.python.PythonGatewayServer$.main(PythonGatewayServer.scala:87)` > `at > org.apache.spark.api.python.PythonGatewayServer.main(PythonGatewayServer.scala)` > `at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)` `at > sun.reflect.NativeMethodAccessorImpl.invoke(`[`NativeMethodAccessorImpl.java:62`](https://NativeMethodAccessorImpl.java:62)`)` > `at > sun.reflect.DelegatingMethodAccessorImpl.invoke(`[`DelegatingMethodAccessorImpl.java:43`](https://DelegatingMethodAccessorImpl.java:43)`)` > `at > java.lang.reflect.Method.invoke(`[`Method.java:498`](https://Method.java:498)`)` > `at > org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)` > `at > org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:845)` > `at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:161)` > `at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:184)` `at > org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:86)` `at > org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:920)` > `at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:929)` `at > org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)` Locked ownable > synchronizers: `- None` "VM Thread" os_prio=0 tid=0x00007f905c08c000 > nid=0x60d runnable "GC task thread#0 (ParallelGC)" os_prio=0 > tid=0x00007f905c02b800 nid=0x605 runnable "GC task thread#1 (ParallelGC)" > os_prio=0 tid=0x00007f905c02d000 nid=0x606 runnable "GC task thread#2 > (ParallelGC)" os_prio=0 tid=0x00007f905c02f000 nid=0x607 runnable "GC task > thread#3 (ParallelGC)" os_prio=0 tid=0x00007f905c030800 nid=0x608 runnable > "GC task thread#4 (ParallelGC)" os_prio=0 tid=0x00007f905c032800 nid=0x609 > runnable "GC task thread#5 (ParallelGC)" os_prio=0 tid=0x00007f905c034000 > nid=0x60a runnable "GC task thread#6 (ParallelGC)" os_prio=0 > tid=0x00007f905c036000 nid=0x60b runnable "GC task thread#7 (ParallelGC)" > os_prio=0 tid=0x00007f905c037800 nid=0x60c runnable "VM Periodic Task Thread" > os_prio=0 tid=0x00007f905c0e0800 nid=0x616 waiting on condition{code} > The main thread is blocking on a file input stream read coming from the > {{PythonGatewayServer}} & the rest of the threads seem to be blocked waiting > for a socket read. Daemon threads running 'sun.nio.ch.selectorImpl.select' > and hanging at 'sun.nio.ch.EPollArrayWrapper.epollWait' keep being added to > those processes, perhaps suggesting that these processes are reused for > subsequent jobs but are not releasing at least some of their resources - > which explains the memory leak observed. -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org