[ 
https://issues.apache.org/jira/browse/MAPREDUCE-4832?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13543414#comment-13543414
 ] 

Jason Lowe commented on MAPREDUCE-4832:
---------------------------------------

bq. Seems like it's possible to avoid multiple AMs by tuning the 
AM_LIVENESS_INTERVAL (10 minutes by default) and MR_AM_TO_RM_WAIT_INTERVAL_MS 
(6 minutes by default). A new AM should only be started after the existing AM 
is done.

That *almost* solves the problem, but there are some corner cases left 
unsolved.  For example:

1) AM is running on a node whose NM suddenly declares itself UNHEALTHY via 
health-check script
2) RM removes node from active nodes and kills all containers running on that 
node
3) Network cut occurs.  NM did not receive notification to kill the containers 
and/or NM crashes.  AM is unable to communicate to RM.
4) RM now thinks all containers are dead on that node, proceeds to relaunch a 
new AM attempt
5) Now for the next 6 minutes (or whatever the expiry interval is for the AM to 
RM) we have two app attempts running simultaneously.  If the old AM attempt is 
able to reach HDFS or whatever it needs to commit, we could end up committing 
twice.

bq. Could add a check to ensure the window interval is greater than the AM-RM 
heartbeat.

Actually that's not strictly necessary.  The code can function correctly even 
if the commit window is smaller than the heartbeat interval.  For example, job 
commit is woken up when a fresh heartbeat arrives, and task commit polls 
periodically for whether the heartbeat has occurred recently.  It's not 
mandatory that the interval between heartbeats is smaller than the commit 
window for a commit to proceed, but it is more likely a commit operation will 
be stalled waiting for a fresh heartbeat if configured that way.

bq. Does getClock() need to be part of the RMHeartbeatHandler. Looks like the 
AppContext can provide this

I put it in the interface so the caller can access the same clock used to 
timestamp the heartbeat in case it could be different from the AppContext clock 
or if the caller didn't have access to the AppContext.  But that's probably 
never going to be a real concern, so I'll take it out.

And to address Bikas' comment:
bq. Independent of this change, this looks like a problem that needs to be 
solved in the platform than in the AM.

We might be able to close all the corner cases in the framework.  For example, 
the above scenario could be solved if the RM were to wait for confirmation from 
the NM of the containers actually expiring before proceeding to launch another 
attempt.  If the NM is unreachable before the confirmation is received, it 
could wait for the AM expiry interval before launching a new attempt.  It could 
mean that we wait a lot longer than necessary, but at least we'd know with 
confidence that two attempts aren't running simultaneously.

                
> MR AM can get in a split brain situation
> ----------------------------------------
>
>                 Key: MAPREDUCE-4832
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4832
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: applicationmaster
>    Affects Versions: 2.0.2-alpha, 0.23.5
>            Reporter: Robert Joseph Evans
>            Assignee: Jason Lowe
>            Priority: Critical
>         Attachments: MAPREDUCE-4832.patch
>
>
> It is possible for a networking issue to happen where the RM thinks an AM has 
> gone down and launches a replacement, but the previous AM is still up and 
> running.  If the previous AM does not need any more resources from the RM it 
> could try to commit either tasks or jobs.  This could cause lots of problems 
> where the second AM finishes and tries to commit too.  This could result in 
> data corruption.  

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to