[ 
https://issues.apache.org/jira/browse/SPARK-11655?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

shane knapp reopened SPARK-11655:
---------------------------------

this is still happening, but nowhere nearly as much as before.

the spark 1.6 SBT hadoop 2.0 build just timed out on jenkins, and left a 
hanging process that's eating up CPU on amp-jenkins-worker-02:

https://amplab.cs.berkeley.edu/jenkins/job/Spark-1.6-SBT/AMPLAB_JENKINS_BUILD_PROFILE=hadoop2.0,label=spark-test/56/console

it times out, as i've noticed on other builds, while running the 
HiveThriftBinaryServerSuite tests.

the PID on that worker is 201960...  i'll leave it alone and not kill it for 
now.

> SparkLauncherBackendSuite leaks child processes
> -----------------------------------------------
>
>                 Key: SPARK-11655
>                 URL: https://issues.apache.org/jira/browse/SPARK-11655
>             Project: Spark
>          Issue Type: Bug
>          Components: Tests
>    Affects Versions: 1.6.0
>            Reporter: Josh Rosen
>            Assignee: Marcelo Vanzin
>            Priority: Blocker
>             Fix For: 1.6.0
>
>         Attachments: month_of_doom.png, screenshot-1.png, year_or_doom.png
>
>
> We've been combatting an orphaned process issue on AMPLab Jenkins since 
> October and I finally was able to dig in and figure out what's going on.
> After some sleuthing and working around OS limits and JDK bugs, I was able to 
> get the full launch commands for the hanging orphaned processes. It looks 
> like they're all running spark-submit:
> {code}
> org.apache.spark.deploy.SparkSubmit --master local-cluster[1,1,1024] --conf 
> spark.driver.extraClassPath=/home/jenkins/workspace/Spark-Master-SBT/AMPLAB_JENKINS_BUILD_PROFILE/hadoop2.0/label/spark-test/core/target/scala-2.10/test-classes:/home/jenkins/workspace/Spark-Master-SBT/AMPLAB_JENKINS_BUILD_PROFILE/hadoop2.0/label/spark-test/core/target/scala-2.10/classes:/home/jenkins/workspace/Spark-Master-SBT/AMPLAB_JENKINS_BUILD_PROFILE/hadoop2.0/label/spark-test/launcher/target/scala-2.10/classes:/home/jenkins/workspace/Spark-Master-SBT/AMPLAB_JENKINS_BUILD_PROFILE/hadoop2.0/label/spark-test/network/common/target/scala-2.10/classes:/home/jenkins/workspace/Spark-Master-SBT/AMPLAB_JENKINS_BUILD_PROFILE/hadoop2.0/label/spark-test/network/shuffle/target/scala-2.10/classes:/home/jenkins/workspace/Spark-Master-SBT/AMPLAB_JENKINS_BUILD_PROFILE/hadoop2.0/label/spark-test/unsafe/target/scala-2.10/classes:/home/jenkins/workspace/Spark-Master-SBT/AMPLAB_JENKINS_BUILD_PROFILE/hadoop2.0/label/spark-test/tags/target/scala-2.10/
>  -Xms1g -Xmx1g -Dtest.appender=console -XX:MaxPermSize=256m
> {code}
> Based on the output of some Ganglia graphs, I was able to figure out that 
> these leaks started around October 9.
>  !screenshot-1.png|thumbnail! 
> This roughly lines up with when https://github.com/apache/spark/pull/7052 was 
> merged, which added LauncherBackendSuite. The launch arguments used in this 
> suite seem to line up with the arguments that I observe in the hanging 
> processes' {{jps}} output: 
> https://github.com/apache/spark/blame/1bc41125ee6306e627be212969854f639969c440/core/src/test/scala/org/apache/spark/launcher/LauncherBackendSuite.scala#L46
> Interestingly, Jenkins doesn't show test timing or output for this suite! I 
> think that what might be happening is that we have a mixed Scala/Java 
> package, so maybe the two test runner XML files aren't being merged properly: 
> https://amplab.cs.berkeley.edu/jenkins/job/Spark-1.5-SBT/746/AMPLAB_JENKINS_BUILD_PROFILE=hadoop1.0,label=spark-test/testReport/org.apache.spark.launcher/
> Whenever I try running this suite locally, it looks like it ends up creating 
> a zombie SparkSubmit process! I think that what's happening is that the 
> launcher's {{handle.kill()}} call ends up destroying the bash 
> {{spark-submit}} subprocess such that its child process (a JVM) leaks.
> I think that we'll have to do something similar to what we do in PySpark when 
> launching a child JVM from a Python / Bash process: connect it to a socket or 
> stream such that it can detect its parent's death and clean up after itself 
> appropriately.
> /cc [~shaneknapp] and [~vanzin].



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to