[ 
https://issues.apache.org/jira/browse/MAPREDUCE-4832?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13543727#comment-13543727
 ] 

Siddharth Seth commented on MAPREDUCE-4832:
-------------------------------------------

bq. AM is running on a node whose NM suddenly declares itself UNHEALTHY via 
health-check script
Right, there's multiple ways in which an AM may time out - and this specific 
case can lead to multiple AMs, so a fix is required.
I'm +1 for the updated patch.
                
> MR AM can get in a split brain situation
> ----------------------------------------
>
>                 Key: MAPREDUCE-4832
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4832
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: applicationmaster
>    Affects Versions: 2.0.2-alpha, 0.23.5
>            Reporter: Robert Joseph Evans
>            Assignee: Jason Lowe
>            Priority: Critical
>         Attachments: MAPREDUCE-4832.patch, MAPREDUCE-4832.patch
>
>
> It is possible for a networking issue to happen where the RM thinks an AM has 
> gone down and launches a replacement, but the previous AM is still up and 
> running.  If the previous AM does not need any more resources from the RM it 
> could try to commit either tasks or jobs.  This could cause lots of problems 
> where the second AM finishes and tries to commit too.  This could result in 
> data corruption.  

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to