[ https://issues.apache.org/jira/browse/SPARK-8571?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14626946#comment-14626946 ]
shane knapp commented on SPARK-8571: ------------------------------------ alright, more updates: this is still happening, though w/much less frequency. i discovered a hanging process on amp-jenkins-worker-01, which was the hadoop 2.3 matrix build spawned by Spark-SBT-Master. this particular build timed out after three hours, and automatically aborted even though it was still running: https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-SBT/2941/AMPLAB_JENKINS_BUILD_PROFILE=hadoop2.3,label=centos/ i looked at the jenkins spec for builds being aborted, and didn't get very far: https://wiki.jenkins-ci.org/display/JENKINS/Aborting+a+build TL;DR: it uses java.lang.UnixProcess.destroyProcess, which send a SIGTERM to the builds. somehow this isn't actually killing everything. one possible solution is to up the timeout by another ~30 minutes, but i don't think that'll necessarily fix the problem. [~joshrosen] thoughts? ps- that hanging process is still running on amp-jenkins-worker-01: PID 120943 > spark streaming hanging processes upon build exit > ------------------------------------------------- > > Key: SPARK-8571 > URL: https://issues.apache.org/jira/browse/SPARK-8571 > Project: Spark > Issue Type: Bug > Components: Build, Streaming > Environment: centos 6.6 amplab build system > Reporter: shane knapp > Assignee: shane knapp > Priority: Minor > Labels: build, test > > over the past 3 months i've been noticing that there are occasionally hanging > processes on our build system workers after various spark builds have > finished. these are all spark streaming processes. > today i noticed a 3+ hour spark build that was timed out after 200 minutes > (https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-Maven-pre-YARN/2994/), > and the matrix build hadoop.version=2.0.0-mr1-cdh4.1.2 ran on > amp-jenkins-worker-02. after the timeout, it left the following process (and > all of it's children) hanging. > the process' CLI command was: > {quote} > [root@amp-jenkins-worker-02 ~]# ps auxwww|grep 1714 > jenkins 1714 733 2.7 21342148 3642740 ? Sl 07:52 1713:41 java > -Dderby.system.durability=test -Djava.awt.headless=true > -Djava.io.tmpdir=/home/jenkins/workspace/Spark-Master-Maven-pre-YARN/hadoop.version/2.0.0-mr1-cdh4.1.2/label/centos/streaming/target/tmp > -Dspark.driver.allowMultipleContexts=true > -Dspark.test.home=/home/jenkins/workspace/Spark-Master-Maven-pre-YARN/hadoop.version/2.0.0-mr1-cdh4.1.2/label/centos > -Dspark.testing=1 -Dspark.ui.enabled=false > -Dspark.ui.showConsoleProgress=false > -Dbasedir=/home/jenkins/workspace/Spark-Master-Maven-pre-YARN/hadoop.version/2.0.0-mr1-cdh4.1.2/label/centos/streaming > -ea -Xmx3g -XX:MaxPermSize=512m -XX:ReservedCodeCacheSize=512m > org.scalatest.tools.Runner -R > /home/jenkins/workspace/Spark-Master-Maven-pre-YARN/hadoop.version/2.0.0-mr1-cdh4.1.2/label/centos/streaming/target/scala-2.10/classes > > /home/jenkins/workspace/Spark-Master-Maven-pre-YARN/hadoop.version/2.0.0-mr1-cdh4.1.2/label/centos/streaming/target/scala-2.10/test-classes > -o -f > /home/jenkins/workspace/Spark-Master-Maven-pre-YARN/hadoop.version/2.0.0-mr1-cdh4.1.2/label/centos/streaming/target/surefire-reports/SparkTestSuite.txt > -u > /home/jenkins/workspace/Spark-Master-Maven-pre-YARN/hadoop.version/2.0.0-mr1-cdh4.1.2/label/centos/streaming/target/surefire-reports/. > {quote} > stracing that process doesn't give us much: > {quote} > [root@amp-jenkins-worker-02 ~]# strace -p 1714 > Process 1714 attached - interrupt to quit > futex(0x7ff3cdd269d0, FUTEX_WAIT, 1715, NULL > {quote} > stracing it's children gives is a *little* bit more... some loop like this: > {quote} > <snip> > futex(0x7ff3c8012d28, FUTEX_WAKE_PRIVATE, 1) = 0 > futex(0x7ff3c8012f54, FUTEX_WAIT_PRIVATE, 28969, NULL) = 0 > futex(0x7ff3c8012f28, FUTEX_WAKE_PRIVATE, 1) = 0 > futex(0x7ff3c8f17954, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x7ff3c8f17950, > {FUTEX_OP_SET, 0, FUTEX_OP_CMP_GT, 1}) = 1 > futex(0x7ff3c8f17928, FUTEX_WAKE_PRIVATE, 1) = 1 > futex(0x7ff3c8012d54, FUTEX_WAIT_BITSET_PRIVATE, 1, {2263862, 865233273}, > ffffffff) = -1 ETIMEDOUT (Connection timed out) > {quote} > and others loop on prtrace_attach (no such process) or restart_syscall > (resuming interrupted call) > even though this behavior has been solidly pinned to jobs timing out (which > ends w/an aborted, not failed, build), i've seen it happen for failed builds > as well. if i see any hanging processes from failed (not aborted) builds, i > will investigate them and update this bug as well. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org