[jira] [Commented] (YARN-1185) FileSystemRMStateStore can leave partial files that prevent subsequent recovery
[ https://issues.apache.org/jira/browse/YARN-1185?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13800137#comment-13800137 ] Hudson commented on YARN-1185: -- FAILURE: Integrated in Hadoop-Mapreduce-trunk #1584 (See [https://builds.apache.org/job/Hadoop-Mapreduce-trunk/1584/]) YARN-1185. Fixed FileSystemRMStateStore to not leave partial files that prevent subsequent ResourceManager recovery. Contributed by Omkar Vinit Joshi. (vinodkv: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1533803) * /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/recovery/FileSystemRMStateStore.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/recovery/RMStateStoreTestBase.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/recovery/TestFSRMStateStore.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/recovery/TestRMStateStore.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/recovery/TestZKRMStateStore.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/recovery/TestZKRMStateStoreZKClientConnections.java > FileSystemRMStateStore can leave partial files that prevent subsequent > recovery > --- > > Key: YARN-1185 > URL: https://issues.apache.org/jira/browse/YARN-1185 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager >Affects Versions: 2.1.0-beta >Reporter: Jason Lowe >Assignee: Omkar Vinit Joshi > Fix For: 2.3.0 > > Attachments: YARN-1185.1.patch, YARN-1185.2.patch, YARN-1185.3.patch > > > FileSystemRMStateStore writes directly to the destination file when storing > state. However if the RM were to crash in the middle of the write, the > recovery method could encounter a partially-written file and either outright > crash during recovery or silently load incomplete state. > To avoid this, the data should be written to a temporary file and renamed to > the destination file afterwards. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (YARN-1185) FileSystemRMStateStore can leave partial files that prevent subsequent recovery
[ https://issues.apache.org/jira/browse/YARN-1185?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13800126#comment-13800126 ] Hudson commented on YARN-1185: -- FAILURE: Integrated in Hadoop-Hdfs-trunk #1558 (See [https://builds.apache.org/job/Hadoop-Hdfs-trunk/1558/]) YARN-1185. Fixed FileSystemRMStateStore to not leave partial files that prevent subsequent ResourceManager recovery. Contributed by Omkar Vinit Joshi. (vinodkv: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1533803) * /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/recovery/FileSystemRMStateStore.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/recovery/RMStateStoreTestBase.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/recovery/TestFSRMStateStore.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/recovery/TestRMStateStore.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/recovery/TestZKRMStateStore.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/recovery/TestZKRMStateStoreZKClientConnections.java > FileSystemRMStateStore can leave partial files that prevent subsequent > recovery > --- > > Key: YARN-1185 > URL: https://issues.apache.org/jira/browse/YARN-1185 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager >Affects Versions: 2.1.0-beta >Reporter: Jason Lowe >Assignee: Omkar Vinit Joshi > Fix For: 2.3.0 > > Attachments: YARN-1185.1.patch, YARN-1185.2.patch, YARN-1185.3.patch > > > FileSystemRMStateStore writes directly to the destination file when storing > state. However if the RM were to crash in the middle of the write, the > recovery method could encounter a partially-written file and either outright > crash during recovery or silently load incomplete state. > To avoid this, the data should be written to a temporary file and renamed to > the destination file afterwards. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (YARN-1185) FileSystemRMStateStore can leave partial files that prevent subsequent recovery
[ https://issues.apache.org/jira/browse/YARN-1185?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13800121#comment-13800121 ] Hudson commented on YARN-1185: -- SUCCESS: Integrated in Hadoop-Yarn-trunk #368 (See [https://builds.apache.org/job/Hadoop-Yarn-trunk/368/]) YARN-1185. Fixed FileSystemRMStateStore to not leave partial files that prevent subsequent ResourceManager recovery. Contributed by Omkar Vinit Joshi. (vinodkv: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1533803) * /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/recovery/FileSystemRMStateStore.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/recovery/RMStateStoreTestBase.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/recovery/TestFSRMStateStore.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/recovery/TestRMStateStore.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/recovery/TestZKRMStateStore.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/recovery/TestZKRMStateStoreZKClientConnections.java > FileSystemRMStateStore can leave partial files that prevent subsequent > recovery > --- > > Key: YARN-1185 > URL: https://issues.apache.org/jira/browse/YARN-1185 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager >Affects Versions: 2.1.0-beta >Reporter: Jason Lowe >Assignee: Omkar Vinit Joshi > Fix For: 2.3.0 > > Attachments: YARN-1185.1.patch, YARN-1185.2.patch, YARN-1185.3.patch > > > FileSystemRMStateStore writes directly to the destination file when storing > state. However if the RM were to crash in the middle of the write, the > recovery method could encounter a partially-written file and either outright > crash during recovery or silently load incomplete state. > To avoid this, the data should be written to a temporary file and renamed to > the destination file afterwards. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (YARN-1185) FileSystemRMStateStore can leave partial files that prevent subsequent recovery
[ https://issues.apache.org/jira/browse/YARN-1185?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13799966#comment-13799966 ] Hudson commented on YARN-1185: -- SUCCESS: Integrated in Hadoop-trunk-Commit #4633 (See [https://builds.apache.org/job/Hadoop-trunk-Commit/4633/]) YARN-1185. Fixed FileSystemRMStateStore to not leave partial files that prevent subsequent ResourceManager recovery. Contributed by Omkar Vinit Joshi. (vinodkv: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1533803) * /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/recovery/FileSystemRMStateStore.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/recovery/RMStateStoreTestBase.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/recovery/TestFSRMStateStore.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/recovery/TestRMStateStore.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/recovery/TestZKRMStateStore.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/recovery/TestZKRMStateStoreZKClientConnections.java > FileSystemRMStateStore can leave partial files that prevent subsequent > recovery > --- > > Key: YARN-1185 > URL: https://issues.apache.org/jira/browse/YARN-1185 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager >Affects Versions: 2.1.0-beta >Reporter: Jason Lowe >Assignee: Omkar Vinit Joshi > Attachments: YARN-1185.1.patch, YARN-1185.2.patch, YARN-1185.3.patch > > > FileSystemRMStateStore writes directly to the destination file when storing > state. However if the RM were to crash in the middle of the write, the > recovery method could encounter a partially-written file and either outright > crash during recovery or silently load incomplete state. > To avoid this, the data should be written to a temporary file and renamed to > the destination file afterwards. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (YARN-1185) FileSystemRMStateStore can leave partial files that prevent subsequent recovery
[ https://issues.apache.org/jira/browse/YARN-1185?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13799721#comment-13799721 ] Hadoop QA commented on YARN-1185: - {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12609245/YARN-1185.3.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 4 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. The javadoc tool did not generate any warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 1.3.9) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/2226//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/2226//console This message is automatically generated. > FileSystemRMStateStore can leave partial files that prevent subsequent > recovery > --- > > Key: YARN-1185 > URL: https://issues.apache.org/jira/browse/YARN-1185 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager >Affects Versions: 2.1.0-beta >Reporter: Jason Lowe >Assignee: Omkar Vinit Joshi > Attachments: YARN-1185.1.patch, YARN-1185.2.patch, YARN-1185.3.patch > > > FileSystemRMStateStore writes directly to the destination file when storing > state. However if the RM were to crash in the middle of the write, the > recovery method could encounter a partially-written file and either outright > crash during recovery or silently load incomplete state. > To avoid this, the data should be written to a temporary file and renamed to > the destination file afterwards. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (YARN-1185) FileSystemRMStateStore can leave partial files that prevent subsequent recovery
[ https://issues.apache.org/jira/browse/YARN-1185?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13799707#comment-13799707 ] Vinod Kumar Vavilapalli commented on YARN-1185: --- Patch looks good to me. Can you address the test-issue? > FileSystemRMStateStore can leave partial files that prevent subsequent > recovery > --- > > Key: YARN-1185 > URL: https://issues.apache.org/jira/browse/YARN-1185 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager >Affects Versions: 2.1.0-beta >Reporter: Jason Lowe >Assignee: Omkar Vinit Joshi > Attachments: YARN-1185.1.patch, YARN-1185.2.patch > > > FileSystemRMStateStore writes directly to the destination file when storing > state. However if the RM were to crash in the middle of the write, the > recovery method could encounter a partially-written file and either outright > crash during recovery or silently load incomplete state. > To avoid this, the data should be written to a temporary file and renamed to > the destination file afterwards. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (YARN-1185) FileSystemRMStateStore can leave partial files that prevent subsequent recovery
[ https://issues.apache.org/jira/browse/YARN-1185?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13798727#comment-13798727 ] Hadoop QA commented on YARN-1185: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12609080/YARN-1185.2.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 3 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. The javadoc tool did not generate any warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 1.3.9) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:red}-1 core tests{color}. The patch failed these unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager: org.apache.hadoop.yarn.server.resourcemanager.recovery.TestRMStateStore {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/2216//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/2216//console This message is automatically generated. > FileSystemRMStateStore can leave partial files that prevent subsequent > recovery > --- > > Key: YARN-1185 > URL: https://issues.apache.org/jira/browse/YARN-1185 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager >Affects Versions: 2.1.0-beta >Reporter: Jason Lowe >Assignee: Omkar Vinit Joshi > Attachments: YARN-1185.1.patch, YARN-1185.2.patch > > > FileSystemRMStateStore writes directly to the destination file when storing > state. However if the RM were to crash in the middle of the write, the > recovery method could encounter a partially-written file and either outright > crash during recovery or silently load incomplete state. > To avoid this, the data should be written to a temporary file and renamed to > the destination file afterwards. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (YARN-1185) FileSystemRMStateStore can leave partial files that prevent subsequent recovery
[ https://issues.apache.org/jira/browse/YARN-1185?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13798711#comment-13798711 ] Omkar Vinit Joshi commented on YARN-1185: - Thanks [~vinodkv] and [~jianhe]. bq. Can you please rip apart TestRMStateStore into two tests (files) - TestFileSystemRMStateStore and TestZKRMStateStore but use common code? done. bq. Also, to indicate corruption, instead of .tmp file, we can try to a state-store write with a partial record and try to recover from that. I am already doing this. bq. The test case may also better to assert in the end that the corrupted application/attempt is not loaded back in RMState and doesn't exist in FileSystem Done. Attaching a new patch. > FileSystemRMStateStore can leave partial files that prevent subsequent > recovery > --- > > Key: YARN-1185 > URL: https://issues.apache.org/jira/browse/YARN-1185 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager >Affects Versions: 2.1.0-beta >Reporter: Jason Lowe >Assignee: Omkar Vinit Joshi > Attachments: YARN-1185.1.patch, YARN-1185.2.patch > > > FileSystemRMStateStore writes directly to the destination file when storing > state. However if the RM were to crash in the middle of the write, the > recovery method could encounter a partially-written file and either outright > crash during recovery or silently load incomplete state. > To avoid this, the data should be written to a temporary file and renamed to > the destination file afterwards. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (YARN-1185) FileSystemRMStateStore can leave partial files that prevent subsequent recovery
[ https://issues.apache.org/jira/browse/YARN-1185?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13797558#comment-13797558 ] Jian He commented on YARN-1185: --- The test case may also better to assert in the end that the corrupted application/attempt is not loaded back in RMState and doesn't exist in FileSystem > FileSystemRMStateStore can leave partial files that prevent subsequent > recovery > --- > > Key: YARN-1185 > URL: https://issues.apache.org/jira/browse/YARN-1185 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager >Affects Versions: 2.1.0-beta >Reporter: Jason Lowe >Assignee: Omkar Vinit Joshi > Attachments: YARN-1185.1.patch > > > FileSystemRMStateStore writes directly to the destination file when storing > state. However if the RM were to crash in the middle of the write, the > recovery method could encounter a partially-written file and either outright > crash during recovery or silently load incomplete state. > To avoid this, the data should be written to a temporary file and renamed to > the destination file afterwards. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (YARN-1185) FileSystemRMStateStore can leave partial files that prevent subsequent recovery
[ https://issues.apache.org/jira/browse/YARN-1185?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13795511#comment-13795511 ] Hadoop QA commented on YARN-1185: - {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12608545/YARN-1185.1.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. The javadoc tool did not generate any warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 1.3.9) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/2178//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/2178//console This message is automatically generated. > FileSystemRMStateStore can leave partial files that prevent subsequent > recovery > --- > > Key: YARN-1185 > URL: https://issues.apache.org/jira/browse/YARN-1185 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager >Affects Versions: 2.1.0-beta >Reporter: Jason Lowe >Assignee: Omkar Vinit Joshi > Attachments: YARN-1185.1.patch > > > FileSystemRMStateStore writes directly to the destination file when storing > state. However if the RM were to crash in the middle of the write, the > recovery method could encounter a partially-written file and either outright > crash during recovery or silently load incomplete state. > To avoid this, the data should be written to a temporary file and renamed to > the destination file afterwards. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (YARN-1185) FileSystemRMStateStore can leave partial files that prevent subsequent recovery
[ https://issues.apache.org/jira/browse/YARN-1185?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13795461#comment-13795461 ] Omkar Vinit Joshi commented on YARN-1185: - I think it would be fair to assume that rename operation is atomic in nature and we can split the existing writeFile operation into two calls * First write the data to .tmp file * rename it to actual file. Similarly when we are loading the state if we encounter any file with ".tmp" extension then we will discard it. Attaching the patch which does the same thing. Let me know your thoughts. > FileSystemRMStateStore can leave partial files that prevent subsequent > recovery > --- > > Key: YARN-1185 > URL: https://issues.apache.org/jira/browse/YARN-1185 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager >Affects Versions: 2.1.0-beta >Reporter: Jason Lowe >Assignee: Omkar Vinit Joshi > Attachments: YARN-1185.1.patch > > > FileSystemRMStateStore writes directly to the destination file when storing > state. However if the RM were to crash in the middle of the write, the > recovery method could encounter a partially-written file and either outright > crash during recovery or silently load incomplete state. > To avoid this, the data should be written to a temporary file and renamed to > the destination file afterwards. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (YARN-1185) FileSystemRMStateStore can leave partial files that prevent subsequent recovery
[ https://issues.apache.org/jira/browse/YARN-1185?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13782363#comment-13782363 ] Arpit Gupta commented on YARN-1185: --- Here is the stack trace from the RM when it tries to recover partially written data {code} 2013-09-30 09:12:09,206 INFO capacity.CapacityScheduler (CapacityScheduler.java:parseQueue(408)) - Initialized queue: default: capacity=1.0, absoluteCapacity=1.0, usedResources=usedCapacity=0.0, absoluteUsedCapacity=0.0, numApps=0, numContainers=0 2013-09-30 09:12:09,206 INFO capacity.CapacityScheduler (CapacityScheduler.java:parseQueue(408)) - Initialized queue: root: numChildQueue= 1, capacity=1.0, absoluteCapacity=1.0, usedResources=usedCapacity=0.0, numApps=0, numContainers=0 2013-09-30 09:12:09,206 INFO capacity.CapacityScheduler (CapacityScheduler.java:initializeQueues(306)) - Initialized root queue root: numChildQueue= 1, capacity=1.0, absoluteCapacity=1.0, usedResources=usedCapacity=0.0, numApps=0, numContainers=0 2013-09-30 09:12:09,206 INFO capacity.CapacityScheduler (CapacityScheduler.java:reinitialize(270)) - Initialized CapacityScheduler with calculator=class org.apache.hadoop.yarn.util.resource.DefaultResourceCalculator, minimumAllocation=<>, maximumAllocation=<> 2013-09-30 09:12:09,240 INFO event.AsyncDispatcher (AsyncDispatcher.java:register(157)) - Registering class org.apache.hadoop.yarn.server.resourcemanager.RMAppManagerEventType for class org.apache.hadoop.yarn.server.resourcemanager.RMAppManager 2013-09-30 09:12:09,250 INFO event.AsyncDispatcher (AsyncDispatcher.java:register(157)) - Registering class org.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncherEventType for class org.apache.hadoop.yarn.server.resourcemanager.amlauncher.ApplicationMasterLauncher 2013-09-30 09:12:09,252 INFO resourcemanager.RMNMInfo (RMNMInfo.java:(63)) - Registered RMNMInfo MBean 2013-09-30 09:12:09,253 INFO util.HostsFileReader (HostsFileReader.java:refresh(84)) - Refreshing hosts (include/exclude) list 2013-09-30 09:12:09,278 INFO security.UserGroupInformation (UserGroupInformation.java:loginUserFromKeytab(843)) - Login successful for user rm/hostname@realm using keytab file /etc/security/keytabs/rm.service.keytab 2013-09-30 09:12:09,278 INFO security.RMContainerTokenSecretManager (RMContainerTokenSecretManager.java:rollMasterKey(103)) - Rolling master-key for container-tokens 2013-09-30 09:12:09,279 INFO security.AMRMTokenSecretManager (AMRMTokenSecretManager.java:rollMasterKey(107)) - Rolling master-key for amrm-tokens 2013-09-30 09:12:09,281 INFO security.NMTokenSecretManagerInRM (NMTokenSecretManagerInRM.java:rollMasterKey(97)) - Rolling master-key for nm-tokens 2013-09-30 09:12:10,196 INFO recovery.FileSystemRMStateStore (FileSystemRMStateStore.java:loadRMAppState(131)) - Loading application from node: application_1380531989689_0002 2013-09-30 09:12:10,217 INFO recovery.FileSystemRMStateStore (FileSystemRMStateStore.java:loadRMAppState(131)) - Loading application from node: application_1380531989689_0003 2013-09-30 09:12:10,232 INFO security.RMDelegationTokenSecretManager (RMDelegationTokenSecretManager.java:recover(181)) - recovering RMDelegationTokenSecretManager. 2013-09-30 09:12:10,234 INFO resourcemanager.RMAppManager (RMAppManager.java:recover(329)) - Recovering 2 applications 2013-09-30 09:12:10,234 ERROR resourcemanager.ResourceManager (ResourceManager.java:serviceStart(640)) - Failed to load/recover state java.lang.NullPointerException at org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recover(RMAppManager.java:332) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.recover(ResourceManager.java:842) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.serviceStart(ResourceManager.java:636) at org.apache.hadoop.service.AbstractService.start(AbstractService.java:193) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.main(ResourceManager.java:855) 2013-09-30 09:12:10,236 INFO util.ExitUtil (ExitUtil.java:terminate(124)) - Exiting with status 1 2013-09-30 09:17:20,144 INFO resourcemanager.ResourceManager (StringUtils.java:startupShutdownMessage(601)) - STARTUP_MSG: {code} > FileSystemRMStateStore can leave partial files that prevent subsequent > recovery > --- > > Key: YARN-1185 > URL: https://issues.apache.org/jira/browse/YARN-1185 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 2.1.0-beta >Reporter: Jason Lowe > > FileSystemRMStateStore writes directly to the destination file when storing > state. However if the RM were to crash in the middle of the write, the > recovery method could encounter a partial