[jira] [Updated] (YARN-1185) FileSystemRMStateStore can leave partial files that prevent subsequent recovery
[ https://issues.apache.org/jira/browse/YARN-1185?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Omkar Vinit Joshi updated YARN-1185: Attachment: YARN-1185.3.patch FileSystemRMStateStore can leave partial files that prevent subsequent recovery --- Key: YARN-1185 URL: https://issues.apache.org/jira/browse/YARN-1185 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Affects Versions: 2.1.0-beta Reporter: Jason Lowe Assignee: Omkar Vinit Joshi Attachments: YARN-1185.1.patch, YARN-1185.2.patch, YARN-1185.3.patch FileSystemRMStateStore writes directly to the destination file when storing state. However if the RM were to crash in the middle of the write, the recovery method could encounter a partially-written file and either outright crash during recovery or silently load incomplete state. To avoid this, the data should be written to a temporary file and renamed to the destination file afterwards. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Updated] (YARN-1185) FileSystemRMStateStore can leave partial files that prevent subsequent recovery
[ https://issues.apache.org/jira/browse/YARN-1185?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Omkar Vinit Joshi updated YARN-1185: Attachment: YARN-1185.2.patch FileSystemRMStateStore can leave partial files that prevent subsequent recovery --- Key: YARN-1185 URL: https://issues.apache.org/jira/browse/YARN-1185 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Affects Versions: 2.1.0-beta Reporter: Jason Lowe Assignee: Omkar Vinit Joshi Attachments: YARN-1185.1.patch, YARN-1185.2.patch FileSystemRMStateStore writes directly to the destination file when storing state. However if the RM were to crash in the middle of the write, the recovery method could encounter a partially-written file and either outright crash during recovery or silently load incomplete state. To avoid this, the data should be written to a temporary file and renamed to the destination file afterwards. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Updated] (YARN-1185) FileSystemRMStateStore can leave partial files that prevent subsequent recovery
[ https://issues.apache.org/jira/browse/YARN-1185?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Omkar Vinit Joshi updated YARN-1185: Attachment: YARN-1185.1.patch FileSystemRMStateStore can leave partial files that prevent subsequent recovery --- Key: YARN-1185 URL: https://issues.apache.org/jira/browse/YARN-1185 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Affects Versions: 2.1.0-beta Reporter: Jason Lowe Assignee: Omkar Vinit Joshi Attachments: YARN-1185.1.patch FileSystemRMStateStore writes directly to the destination file when storing state. However if the RM were to crash in the middle of the write, the recovery method could encounter a partially-written file and either outright crash during recovery or silently load incomplete state. To avoid this, the data should be written to a temporary file and renamed to the destination file afterwards. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Updated] (YARN-1185) FileSystemRMStateStore can leave partial files that prevent subsequent recovery
[ https://issues.apache.org/jira/browse/YARN-1185?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jian He updated YARN-1185: -- Issue Type: Sub-task (was: Bug) Parent: YARN-128 FileSystemRMStateStore can leave partial files that prevent subsequent recovery --- Key: YARN-1185 URL: https://issues.apache.org/jira/browse/YARN-1185 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Affects Versions: 2.1.0-beta Reporter: Jason Lowe FileSystemRMStateStore writes directly to the destination file when storing state. However if the RM were to crash in the middle of the write, the recovery method could encounter a partially-written file and either outright crash during recovery or silently load incomplete state. To avoid this, the data should be written to a temporary file and renamed to the destination file afterwards. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Updated] (YARN-1185) FileSystemRMStateStore can leave partial files that prevent subsequent recovery
[ https://issues.apache.org/jira/browse/YARN-1185?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jason Lowe updated YARN-1185: - Summary: FileSystemRMStateStore can leave partial files that prevent subsequent recovery (was: FileSystemRMStateStore doesn't use temporary files when writing data) bq. The RM will not start if there is anything wrong with the stored state. So it some write is partial/empty is will not start. The concern I have about that approach is it requires manual intervention from ops when there is a problem, and the current scheme can lead to that situation occurring because the RM can crash at arbitrary points. I think the RM should try to prevent that situation from occurring and/or have the ability to automatically recover from that situation if it does occur. The RM could skip the corrupted info and continue if the info is deemed not critical to the overall recovery process. Then we're only involving ops if the corruption is very serious. {quote} So we could do the following. Storing app data may continue to be optimistic and since thats the main workload we continue to do what we do today. Storing global data (mainly the security stuff) can change to be more atomic. {quote} That sounds reasonable, especially if the RM is more robust during recovery. I understand it's a tradeoff between reliability and performance, especially with the RPC overhead when talking to HDFS and the potentially high rate of state churn. Thanks for the informative discussion, [~bikassaha]! Updating the summary to better reflect the problem and not a particular solution. FileSystemRMStateStore can leave partial files that prevent subsequent recovery --- Key: YARN-1185 URL: https://issues.apache.org/jira/browse/YARN-1185 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.1.0-beta Reporter: Jason Lowe FileSystemRMStateStore writes directly to the destination file when storing state. However if the RM were to crash in the middle of the write, the recovery method could encounter a partially-written file and either outright crash during recovery or silently load incomplete state. To avoid this, the data should be written to a temporary file and renamed to the destination file afterwards. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira