El R created SPARK-32969:
----------------------------

             Summary: Spark Submit process not exiting after session.stop()
                 Key: SPARK-32969
                 URL: https://issues.apache.org/jira/browse/SPARK-32969
             Project: Spark
          Issue Type: Bug
          Components: PySpark, Spark Submit
    Affects Versions: 3.0.1, 2.4.7
            Reporter: El R


Exactly 3 spark submit processes are hanging from the first 3 jobs that were 
submitted to the standalone cluster using client mode. Example from the client:
{code:java}
root 1517 0.3 4.7 8412728 1532876 ? Sl 18:49 0:38 
/usr/lib/jvm/java-8-openjdk-amd64/jre/bin/java -cp 
/usr/local/spark/conf/:/usr/local/spark/jars/*:/usr/local/hadoop-2.7.7/etc/hadoop/:/usr/local/hadoop-2.7.7/share/hadoop/common/lib/*:/usr/local/hadoop-2.7.7/share/hadoop/common/*:/usr/local/hadoop-2.7.7/share/hadoop/hdfs/:/usr/local/hadoop-2.7.7/share/hadoop/hdfs/lib/*:/usr/local/hadoop-2.7.7/share/hadoop/hdfs/*:/usr/local/hadoop-2.7.7/share/hadoop/yarn/lib/*:/usr/local/hadoop-2.7.7/share/hadoop/yarn/*:/usr/local/hadoop-2.7.7/share/hadoop/mapreduce/lib/*:/usr/local/hadoop-2.7.7/share/hadoop/mapreduce/*:/usr/local/hadoop/contrib/capacity-scheduler/*.jar
 -Xmx2g org.apache.spark.deploy.SparkSubmit --conf spark.driver.port=46101 
--conf spark.master=spark://3c520b0c6d6e:7077 --conf 
spark.scheduler.allocation.file=/home/jovyan/work/spark_scheduler_allocation.xml
 --conf spark.app.name=REDACTED --conf spark.driver.bindAddress=3c520b0c6d6e 
--conf spark.fileserver.port=46102 --conf 
packages=org.apache.kudu:kudu-spark2_2.11:1.12.0 --conf 
spark.broadcast.port=46103 --conf spark.driver.host=3c520b0c6d6e --conf 
spark.replClassServer.port=46104 --conf 
spark.executorEnv.AF_ALERTS_STREAM_KEY=ALERTS_STREAM_LIST --conf 
spark.scheduler.mode=FAIR --conf spark.shuffle.service.enabled=True --conf 
spark.blockManager.port=46105 --conf spark.dynamicAllocation.enabled=true 
pyspark-shell 
root 1746 0.4 3.5 8152640 1132420 ? Sl 18:59 0:36 
/usr/lib/jvm/java-8-openjdk-amd64/jre/bin/java -cp 
/usr/local/spark/conf/:/usr/local/spark/jars/*:/usr/local/hadoop-2.7.7/etc/hadoop/:/usr/local/hadoop-2.7.7/share/hadoop/common/lib/*:/usr/local/hadoop-2.7.7/share/hadoop/common/*:/usr/local/hadoop-2.7.7/share/hadoop/hdfs/:/usr/local/hadoop-2.7.7/share/hadoop/hdfs/lib/*:/usr/local/hadoop-2.7.7/share/hadoop/hdfs/*:/usr/local/hadoop-2.7.7/share/hadoop/yarn/lib/*:/usr/local/hadoop-2.7.7/share/hadoop/yarn/*:/usr/local/hadoop-2.7.7/share/hadoop/mapreduce/lib/*:/usr/local/hadoop-2.7.7/share/hadoop/mapreduce/*:/usr/local/hadoop/contrib/capacity-scheduler/*.jar
 -Xmx2g org.apache.spark.deploy.SparkSubmit --conf spark.driver.port=46101 
--conf spark.master=spark://3c520b0c6d6e:7077 --conf 
spark.scheduler.allocation.file=/home/jovyan/work/spark_scheduler_allocation.xml
 --conf spark.app.name=REDACTED --conf spark.driver.bindAddress=3c520b0c6d6e 
--conf spark.fileserver.port=46102 --conf 
packages=org.apache.kudu:kudu-spark2_2.11:1.12.0 --conf 
spark.broadcast.port=46103 --conf spark.driver.host=3c520b0c6d6e --conf 
spark.replClassServer.port=46104 --conf 
spark.executorEnv.AF_ALERTS_STREAM_KEY=ALERTS_STREAM_LIST --conf 
spark.scheduler.mode=FAIR --conf spark.shuffle.service.enabled=True --conf 
spark.blockManager.port=46105 --conf spark.dynamicAllocation.enabled=true 
pyspark-shell 
root 2239 65.3 7.8 9743456 2527236 ? Sl 19:10 91:30 
/usr/lib/jvm/java-8-openjdk-amd64/jre/bin/java -cp 
/usr/local/spark/conf/:/usr/local/spark/jars/*:/usr/local/hadoop-2.7.7/etc/hadoop/:/usr/local/hadoop-2.7.7/share/hadoop/common/lib/*:/usr/local/hadoop-2.7.7/share/hadoop/common/*:/usr/local/hadoop-2.7.7/share/hadoop/hdfs/:/usr/local/hadoop-2.7.7/share/hadoop/hdfs/lib/*:/usr/local/hadoop-2.7.7/share/hadoop/hdfs/*:/usr/local/hadoop-2.7.7/share/hadoop/yarn/lib/*:/usr/local/hadoop-2.7.7/share/hadoop/yarn/*:/usr/local/hadoop-2.7.7/share/hadoop/mapreduce/lib/*:/usr/local/hadoop-2.7.7/share/hadoop/mapreduce/*:/usr/local/hadoop/contrib/capacity-scheduler/*.jar
 -Xmx2g org.apache.spark.deploy.SparkSubmit --conf spark.driver.port=46101 
--conf spark.master=spark://3c520b0c6d6e:7077 --conf 
spark.scheduler.allocation.file=/home/jovyan/work/spark_scheduler_allocation.xml
 --conf spark.app.name=REDACTED --conf spark.driver.bindAddress=3c520b0c6d6e 
--conf spark.fileserver.port=46102 --conf 
packages=org.apache.kudu:kudu-spark2_2.11:1.12.0 --conf 
spark.broadcast.port=46103 --conf spark.driver.host=3c520b0c6d6e --conf 
spark.replClassServer.port=46104 --conf 
spark.executorEnv.AF_ALERTS_STREAM_KEY=ALERTS_STREAM_LIST --conf 
spark.scheduler.mode=FAIR --conf spark.shuffle.service.enabled=True --conf 
spark.blockManager.port=46105 --conf spark.dynamicAllocation.enabled=true 
pyspark-shell
 
{code}
The corresponding jobs are showing as 'completed' in spark UI and have closed 
their sessions & exited according to their logs. No worker resources are being 
consumed by these jobs anymore & subsequent jobs are able to receive maximum 
executors and run as expected. However, these 3 processes keep consuming memory 
in a slowly growing rate, which eventually leads to an OOM when trying to 
allocate a new driver.


The thread list of process 1517 above is showing the following user threads 
(daemon threads omitted):
{code:java}
"Thread-4" #16 prio=5 os_prio=0 tid=0x00007f8fe4008000 nid=0x61c runnable 
[0x00007f9029227000] java.lang.Thread.State: RUNNABLE `at 
java.net.SocketInputStream.socketRead0(Native Method)` `at 
java.net.SocketInputStream.socketRead(`[`SocketInputStream.java:116`](https://SocketInputStream.java:116)`)`
 `at` 
[`java.net.SocketInputStream.read`](https://java.net.SocketInputStream.read)`(`[`SocketInputStream.java:171`](https://SocketInputStream.java:171)`)`
 `at` 
[`java.net.SocketInputStream.read`](https://java.net.SocketInputStream.read)`(`[`SocketInputStream.java:141`](https://SocketInputStream.java:141)`)`
 `at 
sun.nio.cs.StreamDecoder.readBytes(`[`StreamDecoder.java:284`](https://StreamDecoder.java:284)`)`
 `at 
sun.nio.cs.StreamDecoder.implRead(`[`StreamDecoder.java:326`](https://StreamDecoder.java:326)`)`
 `at` 
[`sun.nio.cs.StreamDecoder.read`](https://sun.nio.cs.StreamDecoder.read)`(`[`StreamDecoder.java:178`](https://StreamDecoder.java:178)`)`
 `- locked <0x00000000800f8a88> (a java.io.InputStreamReader)` `at` 
[`java.io.InputStreamReader.read`](https://java.io.InputStreamReader.read)`(`[`InputStreamReader.java:184`](https://InputStreamReader.java:184)`)`
 `at 
java.io.BufferedReader.fill(`[`BufferedReader.java:161`](https://BufferedReader.java:161)`)`
 `at 
java.io.BufferedReader.readLine(`[`BufferedReader.java:324`](https://BufferedReader.java:324)`)`
 `- locked <0x00000000800f8a88> (a java.io.InputStreamReader)` `at 
java.io.BufferedReader.readLine(`[`BufferedReader.java:389`](https://BufferedReader.java:389)`)`
 `at` 
[`py4j.GatewayConnection.run`](https://py4j.GatewayConnection.run)`(`[`GatewayConnection.java:230`](https://GatewayConnection.java:230)`)`
 `at` 
[`java.lang.Thread.run`](https://java.lang.Thread.run)`(`[`Thread.java:748`](https://Thread.java:748)`)`
 Locked ownable synchronizers: `- None` 

"Thread-3" #15 prio=5 os_prio=0 tid=0x00007f905dab7000 nid=0x61b runnable 
[0x00007f9029328000] java.lang.Thread.State: RUNNABLE `at 
java.net.PlainSocketImpl.socketAccept(Native Method)` `at 
java.net.AbstractPlainSocketImpl.accept(`[`AbstractPlainSocketImpl.java:409`](https://AbstractPlainSocketImpl.java:409)`)`
 `at 
java.net.ServerSocket.implAccept(`[`ServerSocket.java:560`](https://ServerSocket.java:560)`)`
 `at 
java.net.ServerSocket.accept(`[`ServerSocket.java:528`](https://ServerSocket.java:528)`)`
 `at` 
[`py4j.GatewayServer.run`](https://py4j.GatewayServer.run)`(`[`GatewayServer.java:685`](https://GatewayServer.java:685)`)`
 `at` 
[`java.lang.Thread.run`](https://java.lang.Thread.run)`(`[`Thread.java:748`](https://Thread.java:748)`)`
 Locked ownable synchronizers: `- None` 

"pool-1-thread-1" #14 prio=5 os_prio=0 tid=0x00007f905daa5000 nid=0x61a waiting 
on condition [0x00007f902982c000] java.lang.Thread.State: TIMED_WAITING 
(parking) `at sun.misc.Unsafe.park(Native Method)` `- parking to wait for 
<0x000000008011cda8> (a 
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)` `at 
java.util.concurrent.locks.LockSupport.parkNanos(`[`LockSupport.java:215`](https://LockSupport.java:215)`)`
 `at 
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(`[`AbstractQueuedSynchronizer.java:2078`](https://AbstractQueuedSynchronizer.java:2078)`)`
 `at 
java.util.concurrent.ScheduledThreadPoolExecutor$DelayedWorkQueue.take(`[`ScheduledThreadPoolExecutor.java:1093`](https://ScheduledThreadPoolExecutor.java:1093)`)`
 `at 
java.util.concurrent.ScheduledThreadPoolExecutor$DelayedWorkQueue.take(`[`ScheduledThreadPoolExecutor.java:809`](https://ScheduledThreadPoolExecutor.java:809)`)`
 `at 
java.util.concurrent.ThreadPoolExecutor.getTask(`[`ThreadPoolExecutor.java:1074`](https://ThreadPoolExecutor.java:1074)`)`
 `at 
java.util.concurrent.ThreadPoolExecutor.runWorker(`[`ThreadPoolExecutor.java:1134`](https://ThreadPoolExecutor.java:1134)`)`
 `at` 
[`java.util.concurrent.ThreadPoolExecutor$Worker.run`](https://java.util.concurrent.ThreadPoolExecutor$Worker.run)`(`[`ThreadPoolExecutor.java:624`](https://ThreadPoolExecutor.java:624)`)`
 `at` 
[`java.lang.Thread.run`](https://java.lang.Thread.run)`(`[`Thread.java:748`](https://Thread.java:748)`)`
 Locked ownable synchronizers: `- None` 

"main" #1 prio=5 os_prio=0 tid=0x00007f905c016800 nid=0x604 runnable 
[0x00007f9062b96000] java.lang.Thread.State: RUNNABLE `at 
java.io.FileInputStream.readBytes(Native Method)` `at` 
[`java.io.FileInputStream.read`](https://java.io.FileInputStream.read)`(`[`FileInputStream.java:255`](https://FileInputStream.java:255)`)`
 `at 
java.io.BufferedInputStream.fill(`[`BufferedInputStream.java:246`](https://BufferedInputStream.java:246)`)`
 `at` 
[`java.io.BufferedInputStream.read`](https://java.io.BufferedInputStream.read)`(`[`BufferedInputStream.java:265`](https://BufferedInputStream.java:265)`)`
 `- locked <0x0000000080189dc8> (a java.io.BufferedInputStream)` `at 
org.apache.spark.api.python.PythonGatewayServer$.main(PythonGatewayServer.scala:87)`
 `at 
org.apache.spark.api.python.PythonGatewayServer.main(PythonGatewayServer.scala)`
 `at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)` `at 
sun.reflect.NativeMethodAccessorImpl.invoke(`[`NativeMethodAccessorImpl.java:62`](https://NativeMethodAccessorImpl.java:62)`)`
 `at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(`[`DelegatingMethodAccessorImpl.java:43`](https://DelegatingMethodAccessorImpl.java:43)`)`
 `at 
java.lang.reflect.Method.invoke(`[`Method.java:498`](https://Method.java:498)`)`
 `at 
org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)` 
`at 
org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:845)`
 `at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:161)` 
`at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:184)` `at 
org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:86)` `at 
org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:920)` 
`at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:929)` `at 
org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)` Locked ownable 
synchronizers: `- None` "VM Thread" os_prio=0 tid=0x00007f905c08c000 nid=0x60d 
runnable "GC task thread#0 (ParallelGC)" os_prio=0 tid=0x00007f905c02b800 
nid=0x605 runnable "GC task thread#1 (ParallelGC)" os_prio=0 
tid=0x00007f905c02d000 nid=0x606 runnable "GC task thread#2 (ParallelGC)" 
os_prio=0 tid=0x00007f905c02f000 nid=0x607 runnable "GC task thread#3 
(ParallelGC)" os_prio=0 tid=0x00007f905c030800 nid=0x608 runnable "GC task 
thread#4 (ParallelGC)" os_prio=0 tid=0x00007f905c032800 nid=0x609 runnable "GC 
task thread#5 (ParallelGC)" os_prio=0 tid=0x00007f905c034000 nid=0x60a runnable 
"GC task thread#6 (ParallelGC)" os_prio=0 tid=0x00007f905c036000 nid=0x60b 
runnable "GC task thread#7 (ParallelGC)" os_prio=0 tid=0x00007f905c037800 
nid=0x60c runnable "VM Periodic Task Thread" os_prio=0 tid=0x00007f905c0e0800 
nid=0x616 waiting on condition{code}
The main thread is blocking on a file input stream read coming from the 
{{PythonGatewayServer}} & the rest of the threads seem to be blocked waiting 
for a socket read. Daemon threads running 'sun.nio.ch.selectorImpl.select' and 
hanging at 'sun.nio.ch.EPollArrayWrapper.epollWait' keep being added to those 
processes, perhaps suggesting that these processes are reused for subsequent 
jobs but are not releasing at least some of their resources - which explains 
the memory leak observed.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to