Sort job on 350 scale is consistently failing with latest MRV2 code 
--------------------------------------------------------------------

                 Key: MAPREDUCE-3656
                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3656
             Project: Hadoop Map/Reduce
          Issue Type: Bug
          Components: applicationmaster, mrv2, resourcemanager
    Affects Versions: 0.23.1
            Reporter: Karam Singh
             Fix For: 0.23.1


With the code checked out on last two days. 
Sort Job on 350 node scale with 16800 maps and 680 reduces consistently failing 
for around last 6 runs
When around 50% of maps are completed, suddenly job jumps to failed state.
On looking at NM log, found RM sent Stop Container Request to NM for AM 
container.
But at INFO level from RM log not able find why RM is killing AM when job is 
not killed manually.
One thing found common on failed AM logs is -:
org.apache.hadoop.yarn.state.InvalidStateTransitonException
With with different.
For e.g. One log says -:
{code}
org.apache.hadoop.yarn.state.InvalidStateTransitonException: Invalid event: 
TA_UPDATE at ASSIGNED 
{code}
Whereas other logs says -:
{code}
org.apache.hadoop.yarn.state.InvalidStateTransitonException: Invalid event: 
JOB_COUNTER_UPDATE at ERROR
{code}



--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to