[ https://issues.apache.org/jira/browse/MAPREDUCE-3656?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Siddharth Seth updated MAPREDUCE-3656: -------------------------------------- Attachment: MR3656.txt Race - exposed by a slow NM startContainer request. In this case - the startContainer call from the NM was processed, the container was launched, but startContainer took time to return on the AM. The launched container gets a task - and starts sending status updates - for a task which the AM considers to be not launched. The patch is waiting for the AM CONTAINER_LAUNCHED event to be processed before assigning a task to the JVM (similar to pre MAPREDUCE-3569) > Sort job on 350 scale is consistently failing with latest MRV2 code > -------------------------------------------------------------------- > > Key: MAPREDUCE-3656 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-3656 > Project: Hadoop Map/Reduce > Issue Type: Bug > Components: applicationmaster, mrv2, resourcemanager > Affects Versions: 0.23.1 > Reporter: Karam Singh > Assignee: Siddharth Seth > Priority: Blocker > Fix For: 0.23.1 > > Attachments: MR3656.txt > > > With the code checked out on last two days. > Sort Job on 350 node scale with 16800 maps and 680 reduces consistently > failing for around last 6 runs > When around 50% of maps are completed, suddenly job jumps to failed state. > On looking at NM log, found RM sent Stop Container Request to NM for AM > container. > But at INFO level from RM log not able find why RM is killing AM when job is > not killed manually. > One thing found common on failed AM logs is -: > org.apache.hadoop.yarn.state.InvalidStateTransitonException > With with different. > For e.g. One log says -: > {code} > org.apache.hadoop.yarn.state.InvalidStateTransitonException: Invalid event: > TA_UPDATE at ASSIGNED > {code} > Whereas other logs says -: > {code} > org.apache.hadoop.yarn.state.InvalidStateTransitonException: Invalid event: > JOB_COUNTER_UPDATE at ERROR > {code} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira