[ https://issues.apache.org/jira/browse/TEZ-3696?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15989447#comment-15989447 ]
TezQA commented on TEZ-3696: ---------------------------- {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12865607/TEZ-3696.003.patch against master revision 247719d. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 2 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 3.0.1) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in . Test results: https://builds.apache.org/job/PreCommit-TEZ-Build/2404//testReport/ Console output: https://builds.apache.org/job/PreCommit-TEZ-Build/2404//console This message is automatically generated. > Jobs can hang when both concurrency and speculation are enabled > --------------------------------------------------------------- > > Key: TEZ-3696 > URL: https://issues.apache.org/jira/browse/TEZ-3696 > Project: Apache Tez > Issue Type: Bug > Reporter: Eric Badger > Assignee: Eric Badger > Attachments: TEZ-3696.001.patch, TEZ-3696.002.patch, > TEZ-3696.003.patch > > > We can reproduce the hung job by doing the following: > 1. Run a sleep job with a concurrency of 1, speculation enabled, and 3 tasks > {noformat} > HADOOP_CLASSPATH="$TEZ_HOME/*:$TEZ_HOME/lib/*:$TEZ_CONF_DIR" yarn jar > $TEZ_HOME/tez-tests-*.jar mrrsleep -Dtez.am.vertex.max-task-concurrency=1 > -Dtez.am.speculation.enabled=true -Dtez.task.timeout-ms=60000 -m 3 -mt 60000 > -ir 0 -irt 0 -r 0 -rt 0 > {noformat} > 2. Let the 1st task run to completion and then stop the 2nd task so that a > speculative attempt is scheduled. Once the speculative attempt is scheduled > for the 2nd task, continue the original attempt and let it complete. > {noformat} > kill -STOP <pid> > // wait a few seconds for a speculative attempt to kick off > kill -CONT <pid> > {noformat} > 3. Kill the 3rd task, which will create a 2nd attempt > {noformat} > kill -9 <pid> > {noformat} > 4. The next thing to be drawn off of the queue will be the speculative > attempt of the 2nd task. However, it is already completed, so it will just > sit in the final state and the job will hang. > Basically, for the failure to happen, the number of speculative tasks that > are scheduled, but not yet ran has to be >= the concurrency of the job and > there has to be at least 1 task failure. -- This message was sent by Atlassian JIRA (v6.3.15#6346)