[ 
https://issues.apache.org/jira/browse/YARN-1776?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13941988#comment-13941988
 ] 

Zhijie Shen commented on YARN-1776:
-----------------------------------

[~kkambatl], sure, please go a head. [~ozawa], thanks for your input. I was 
thinking about  the temp file approach, but I didn't think it can completely 
resolve the issue, and make load the DT state much more complex in the failure 
case. If I understand correctly, FilieSystem interface methods do not ensure 
atomic (the exception is that we previously considered rename is atomic). 
Therefore, RM can fail during and between each of the 4 steps (IMO, 1 and 4 is 
not necessary, and after 3 we need rename new DT file to old file name), and 
load the DT state needs to handle them all. Another issue is that, if you can 
look at the current FileSystemRMStateStore:
{code}
writeFile(nodeCreatePath, os.toByteArray());
    fsOut.close();

    // store sequence number
    Path latestSequenceNumberPath = getNodePath(rmDTSecretManagerRoot,
          DELEGATION_TOKEN_SEQUENCE_NUMBER_PREFIX + latestSequenceNumber);
    LOG.info("Storing " + DELEGATION_TOKEN_SEQUENCE_NUMBER_PREFIX
        + latestSequenceNumber);
{code}
Storing a DT requires accessing two files. Even if we can ensure accessing DT 
file is atomic, the method can still at the comment's place, and DT file is 
updated but dtSequenceNumberPath isn't. Also, see 
updateApplicationStateInternal and updateApplicationAttemptStateInternal. They 
call updateFile:
{code}
  protected void updateFile(Path outputPath, byte[] data) throws Exception {
    if (fs.exists(outputPath)) {
      deleteFile(outputPath);
    }
    writeFile(outputPath, data);
  }
{code}
RM can fail after deleting the file, before writing the file. 

I didn't closely follow the HA feature, but if RM failover relies on 
FSRMStateStore, we may expect some problems due to non-atomic behavior. 
Thoughts?

> renewDelegationToken should survive RM failover
> -----------------------------------------------
>
>                 Key: YARN-1776
>                 URL: https://issues.apache.org/jira/browse/YARN-1776
>             Project: Hadoop YARN
>          Issue Type: Sub-task
>            Reporter: Zhijie Shen
>            Assignee: Zhijie Shen
>         Attachments: YARN-1776.1.patch
>
>
> When a delegation token is renewed, two RMStateStore operations: 1) removing 
> the old DT, and 2) storing the new DT will happen. If RM fails in between. 
> There would be problem.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to