[ 
https://issues.apache.org/jira/browse/MAPREDUCE-4832?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13543390#comment-13543390
 ] 

Bikas Saha commented on MAPREDUCE-4832:
---------------------------------------

Independent of this change, this looks like a problem that needs to be solved 
in the platform than in the AM. Something like making sure the NM maintains an 
expire time on its containers and terminates them when the expire time is 
reached. The expire time is extended whenever the NM heartbeats with the RM. So 
if the NM loses contact with the RM or if the RM thinks the AM should not be 
running anymore on that NM,then the expire time will not be extended. RM starts 
retries after the expire time has elapsed. The logic is similar but self 
contained within the platform. AM's could do similar stuff to their containers. 
Thus providing an automatic garbage collection when an AM crashes.
                
> MR AM can get in a split brain situation
> ----------------------------------------
>
>                 Key: MAPREDUCE-4832
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4832
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: applicationmaster
>    Affects Versions: 2.0.2-alpha, 0.23.5
>            Reporter: Robert Joseph Evans
>            Assignee: Jason Lowe
>            Priority: Critical
>         Attachments: MAPREDUCE-4832.patch
>
>
> It is possible for a networking issue to happen where the RM thinks an AM has 
> gone down and launches a replacement, but the previous AM is still up and 
> running.  If the previous AM does not need any more resources from the RM it 
> could try to commit either tasks or jobs.  This could cause lots of problems 
> where the second AM finishes and tries to commit too.  This could result in 
> data corruption.  

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to