[ https://issues.apache.org/jira/browse/YARN-1776?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13941988#comment-13941988 ]
Zhijie Shen commented on YARN-1776: ----------------------------------- [~kkambatl], sure, please go a head. [~ozawa], thanks for your input. I was thinking about the temp file approach, but I didn't think it can completely resolve the issue, and make load the DT state much more complex in the failure case. If I understand correctly, FilieSystem interface methods do not ensure atomic (the exception is that we previously considered rename is atomic). Therefore, RM can fail during and between each of the 4 steps (IMO, 1 and 4 is not necessary, and after 3 we need rename new DT file to old file name), and load the DT state needs to handle them all. Another issue is that, if you can look at the current FileSystemRMStateStore: {code} writeFile(nodeCreatePath, os.toByteArray()); fsOut.close(); // store sequence number Path latestSequenceNumberPath = getNodePath(rmDTSecretManagerRoot, DELEGATION_TOKEN_SEQUENCE_NUMBER_PREFIX + latestSequenceNumber); LOG.info("Storing " + DELEGATION_TOKEN_SEQUENCE_NUMBER_PREFIX + latestSequenceNumber); {code} Storing a DT requires accessing two files. Even if we can ensure accessing DT file is atomic, the method can still at the comment's place, and DT file is updated but dtSequenceNumberPath isn't. Also, see updateApplicationStateInternal and updateApplicationAttemptStateInternal. They call updateFile: {code} protected void updateFile(Path outputPath, byte[] data) throws Exception { if (fs.exists(outputPath)) { deleteFile(outputPath); } writeFile(outputPath, data); } {code} RM can fail after deleting the file, before writing the file. I didn't closely follow the HA feature, but if RM failover relies on FSRMStateStore, we may expect some problems due to non-atomic behavior. Thoughts? > renewDelegationToken should survive RM failover > ----------------------------------------------- > > Key: YARN-1776 > URL: https://issues.apache.org/jira/browse/YARN-1776 > Project: Hadoop YARN > Issue Type: Sub-task > Reporter: Zhijie Shen > Assignee: Zhijie Shen > Attachments: YARN-1776.1.patch > > > When a delegation token is renewed, two RMStateStore operations: 1) removing > the old DT, and 2) storing the new DT will happen. If RM fails in between. > There would be problem. -- This message was sent by Atlassian JIRA (v6.2#6252)