[ 
https://issues.apache.org/jira/browse/MAPREDUCE-5718?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Karthik Kambatla updated MAPREDUCE-5718:
----------------------------------------

    Attachment: mr-5718-0.patch

First-cut patch that deletes the startCommitFile if the commit is interrupted. 

However, in case of two AMs running during a partition, this can lead to one AM 
deleting the startCommitFile created by another AM. To avoid races in case of a 
partition, we might have to complicate this a little more. 

How about adding a .host.pid suffix to the name of the commit file? Each AM 
would write its own. When a subsequent AM comes up and verifies the state of 
commit from previous AMs, it would look for any? [~vinodkv], [~revans2] - 
thoughts? 

> MR AM should tolerate RM restart/failover during commit
> -------------------------------------------------------
>
>                 Key: MAPREDUCE-5718
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5718
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: mr-am
>    Affects Versions: 2.4.0
>            Reporter: Karthik Kambatla
>            Assignee: Karthik Kambatla
>              Labels: ha
>         Attachments: mr-5718-0.patch
>
>
> While testing RM HA, we ran into this issue where if the RM fails over while 
> an MR AM is in the middle of a commit, the subsequent AM gets spawned but 
> dies with a diagnostic message - "We crashed durring a commit". 



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

Reply via email to