[ 
https://issues.apache.org/jira/browse/SPARK-8571?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14631497#comment-14631497
 ] 

shane knapp commented on SPARK-8571:
------------------------------------

another one:
/usr/java/latest/bin/java -Xmx3g 
-Djava.io.tmpdir=/home/jenkins/workspace/Spark-Master-SBT/AMPLAB_JENKINS_BUILD_PROFILE/hadoop2.2/label/centos/target/tmp
 
-Dspark.test.home=/home/jenkins/workspace/Spark-Master-SBT/AMPLAB_JENKINS_BUILD_PROFILE/hadoop2.2/label/centos
 -Dspark.testing=1 -Dspark.port.maxRetries=100 -Dspark.ui.enabled=false 
-Dspark.ui.showConsoleProgress=false -Dspark.driver.allowMultipleContexts=true 
-Dspark.unsafe.exceptionOnMemoryLeak=true 
-Dsun.io.serialization.extendedDebugInfo=true -Dderby.system.durability=test 
-ea -Xmx3g -Xss4096k -XX:PermSize=128M -XX:MaxNewSize=256m -XX:MaxPermSize=1g

(i removed the classpath for brevity)

> spark streaming hanging processes upon build exit
> -------------------------------------------------
>
>                 Key: SPARK-8571
>                 URL: https://issues.apache.org/jira/browse/SPARK-8571
>             Project: Spark
>          Issue Type: Bug
>          Components: Build, Streaming
>         Environment: centos 6.6 amplab build system
>            Reporter: shane knapp
>            Assignee: shane knapp
>            Priority: Minor
>              Labels: build, test
>
> over the past 3 months i've been noticing that there are occasionally hanging 
> processes on our build system workers after various spark builds have 
> finished.  these are all spark streaming processes.
> today i noticed a 3+ hour spark build that was timed out after 200 minutes 
> (https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-Maven-pre-YARN/2994/),
>  and the matrix build hadoop.version=2.0.0-mr1-cdh4.1.2 ran on 
> amp-jenkins-worker-02.  after the timeout, it left the following process (and 
> all of it's children) hanging.
> the process' CLI command was:
> {quote}
> [root@amp-jenkins-worker-02 ~]# ps auxwww|grep 1714
> jenkins    1714  733  2.7 21342148 3642740 ?    Sl   07:52 1713:41 java 
> -Dderby.system.durability=test -Djava.awt.headless=true 
> -Djava.io.tmpdir=/home/jenkins/workspace/Spark-Master-Maven-pre-YARN/hadoop.version/2.0.0-mr1-cdh4.1.2/label/centos/streaming/target/tmp
>  -Dspark.driver.allowMultipleContexts=true 
> -Dspark.test.home=/home/jenkins/workspace/Spark-Master-Maven-pre-YARN/hadoop.version/2.0.0-mr1-cdh4.1.2/label/centos
>  -Dspark.testing=1 -Dspark.ui.enabled=false 
> -Dspark.ui.showConsoleProgress=false 
> -Dbasedir=/home/jenkins/workspace/Spark-Master-Maven-pre-YARN/hadoop.version/2.0.0-mr1-cdh4.1.2/label/centos/streaming
>  -ea -Xmx3g -XX:MaxPermSize=512m -XX:ReservedCodeCacheSize=512m 
> org.scalatest.tools.Runner -R 
> /home/jenkins/workspace/Spark-Master-Maven-pre-YARN/hadoop.version/2.0.0-mr1-cdh4.1.2/label/centos/streaming/target/scala-2.10/classes
>  
> /home/jenkins/workspace/Spark-Master-Maven-pre-YARN/hadoop.version/2.0.0-mr1-cdh4.1.2/label/centos/streaming/target/scala-2.10/test-classes
>  -o -f 
> /home/jenkins/workspace/Spark-Master-Maven-pre-YARN/hadoop.version/2.0.0-mr1-cdh4.1.2/label/centos/streaming/target/surefire-reports/SparkTestSuite.txt
>  -u 
> /home/jenkins/workspace/Spark-Master-Maven-pre-YARN/hadoop.version/2.0.0-mr1-cdh4.1.2/label/centos/streaming/target/surefire-reports/.
> {quote}
> stracing that process doesn't give us much:
> {quote}
> [root@amp-jenkins-worker-02 ~]# strace -p 1714
> Process 1714 attached - interrupt to quit
> futex(0x7ff3cdd269d0, FUTEX_WAIT, 1715, NULL
> {quote}
> stracing it's children gives is a *little* bit more...  some loop like this:
> {quote}
> <snip>
> futex(0x7ff3c8012d28, FUTEX_WAKE_PRIVATE, 1) = 0
> futex(0x7ff3c8012f54, FUTEX_WAIT_PRIVATE, 28969, NULL) = 0
> futex(0x7ff3c8012f28, FUTEX_WAKE_PRIVATE, 1) = 0
> futex(0x7ff3c8f17954, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x7ff3c8f17950, 
> {FUTEX_OP_SET, 0, FUTEX_OP_CMP_GT, 1}) = 1
> futex(0x7ff3c8f17928, FUTEX_WAKE_PRIVATE, 1) = 1
> futex(0x7ff3c8012d54, FUTEX_WAIT_BITSET_PRIVATE, 1, {2263862, 865233273}, 
> ffffffff) = -1 ETIMEDOUT (Connection timed out)
> {quote}
> and others loop on prtrace_attach (no such process) or restart_syscall 
> (resuming interrupted call)
> even though this behavior has been solidly pinned to jobs timing out (which 
> ends w/an aborted, not failed, build), i've seen it happen for failed builds 
> as well.  if i see any hanging processes from failed (not aborted) builds, i 
> will investigate them and update this bug as well.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to