[ 
https://issues.apache.org/jira/browse/SPARK-8571?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14622924#comment-14622924
 ] 

shane knapp commented on SPARK-8571:
------------------------------------

ok, changes have been made to the build configs...  i'll keep an eye on these 
and make sure they're working as intended.  i'll mark this resolved once i'm 
certain we're in good shape.

> spark streaming hanging processes upon build exit
> -------------------------------------------------
>
>                 Key: SPARK-8571
>                 URL: https://issues.apache.org/jira/browse/SPARK-8571
>             Project: Spark
>          Issue Type: Bug
>          Components: Build, Streaming
>         Environment: centos 6.6 amplab build system
>            Reporter: shane knapp
>            Assignee: shane knapp
>            Priority: Minor
>              Labels: build, test
>
> over the past 3 months i've been noticing that there are occasionally hanging 
> processes on our build system workers after various spark builds have 
> finished.  these are all spark streaming processes.
> today i noticed a 3+ hour spark build that was timed out after 200 minutes 
> (https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-Maven-pre-YARN/2994/),
>  and the matrix build hadoop.version=2.0.0-mr1-cdh4.1.2 ran on 
> amp-jenkins-worker-02.  after the timeout, it left the following process (and 
> all of it's children) hanging.
> the process' CLI command was:
> {quote}
> [root@amp-jenkins-worker-02 ~]# ps auxwww|grep 1714
> jenkins    1714  733  2.7 21342148 3642740 ?    Sl   07:52 1713:41 java 
> -Dderby.system.durability=test -Djava.awt.headless=true 
> -Djava.io.tmpdir=/home/jenkins/workspace/Spark-Master-Maven-pre-YARN/hadoop.version/2.0.0-mr1-cdh4.1.2/label/centos/streaming/target/tmp
>  -Dspark.driver.allowMultipleContexts=true 
> -Dspark.test.home=/home/jenkins/workspace/Spark-Master-Maven-pre-YARN/hadoop.version/2.0.0-mr1-cdh4.1.2/label/centos
>  -Dspark.testing=1 -Dspark.ui.enabled=false 
> -Dspark.ui.showConsoleProgress=false 
> -Dbasedir=/home/jenkins/workspace/Spark-Master-Maven-pre-YARN/hadoop.version/2.0.0-mr1-cdh4.1.2/label/centos/streaming
>  -ea -Xmx3g -XX:MaxPermSize=512m -XX:ReservedCodeCacheSize=512m 
> org.scalatest.tools.Runner -R 
> /home/jenkins/workspace/Spark-Master-Maven-pre-YARN/hadoop.version/2.0.0-mr1-cdh4.1.2/label/centos/streaming/target/scala-2.10/classes
>  
> /home/jenkins/workspace/Spark-Master-Maven-pre-YARN/hadoop.version/2.0.0-mr1-cdh4.1.2/label/centos/streaming/target/scala-2.10/test-classes
>  -o -f 
> /home/jenkins/workspace/Spark-Master-Maven-pre-YARN/hadoop.version/2.0.0-mr1-cdh4.1.2/label/centos/streaming/target/surefire-reports/SparkTestSuite.txt
>  -u 
> /home/jenkins/workspace/Spark-Master-Maven-pre-YARN/hadoop.version/2.0.0-mr1-cdh4.1.2/label/centos/streaming/target/surefire-reports/.
> {quote}
> stracing that process doesn't give us much:
> {quote}
> [root@amp-jenkins-worker-02 ~]# strace -p 1714
> Process 1714 attached - interrupt to quit
> futex(0x7ff3cdd269d0, FUTEX_WAIT, 1715, NULL
> {quote}
> stracing it's children gives is a *little* bit more...  some loop like this:
> {quote}
> <snip>
> futex(0x7ff3c8012d28, FUTEX_WAKE_PRIVATE, 1) = 0
> futex(0x7ff3c8012f54, FUTEX_WAIT_PRIVATE, 28969, NULL) = 0
> futex(0x7ff3c8012f28, FUTEX_WAKE_PRIVATE, 1) = 0
> futex(0x7ff3c8f17954, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x7ff3c8f17950, 
> {FUTEX_OP_SET, 0, FUTEX_OP_CMP_GT, 1}) = 1
> futex(0x7ff3c8f17928, FUTEX_WAKE_PRIVATE, 1) = 1
> futex(0x7ff3c8012d54, FUTEX_WAIT_BITSET_PRIVATE, 1, {2263862, 865233273}, 
> ffffffff) = -1 ETIMEDOUT (Connection timed out)
> {quote}
> and others loop on prtrace_attach (no such process) or restart_syscall 
> (resuming interrupted call)
> even though this behavior has been solidly pinned to jobs timing out (which 
> ends w/an aborted, not failed, build), i've seen it happen for failed builds 
> as well.  if i see any hanging processes from failed (not aborted) builds, i 
> will investigate them and update this bug as well.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to