[ https://issues.apache.org/jira/browse/MAPREDUCE-5718?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Karthik Kambatla updated MAPREDUCE-5718: ---------------------------------------- Attachment: mr-5718-0.patch First-cut patch that deletes the startCommitFile if the commit is interrupted. However, in case of two AMs running during a partition, this can lead to one AM deleting the startCommitFile created by another AM. To avoid races in case of a partition, we might have to complicate this a little more. How about adding a .host.pid suffix to the name of the commit file? Each AM would write its own. When a subsequent AM comes up and verifies the state of commit from previous AMs, it would look for any? [~vinodkv], [~revans2] - thoughts? > MR AM should tolerate RM restart/failover during commit > ------------------------------------------------------- > > Key: MAPREDUCE-5718 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-5718 > Project: Hadoop Map/Reduce > Issue Type: Bug > Components: mr-am > Affects Versions: 2.4.0 > Reporter: Karthik Kambatla > Assignee: Karthik Kambatla > Labels: ha > Attachments: mr-5718-0.patch > > > While testing RM HA, we ran into this issue where if the RM fails over while > an MR AM is in the middle of a commit, the subsequent AM gets spawned but > dies with a diagnostic message - "We crashed durring a commit". -- This message was sent by Atlassian JIRA (v6.1.5#6160)