[jira] [Commented] (YARN-2820) Do retry in FileSystemRMStateStore for better error recovery when update/store failure due to IOException.
[ https://issues.apache.org/jira/browse/YARN-2820?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14340307#comment-14340307 ] Tsuyoshi Ozawa commented on YARN-2820: -- +1, committing this shortly. > Do retry in FileSystemRMStateStore for better error recovery when > update/store failure due to IOException. > -- > > Key: YARN-2820 > URL: https://issues.apache.org/jira/browse/YARN-2820 > Project: Hadoop YARN > Issue Type: Improvement > Components: resourcemanager >Affects Versions: 2.5.0, 2.6.0 >Reporter: zhihai xu >Assignee: zhihai xu > Attachments: YARN-2820.000.patch, YARN-2820.001.patch, > YARN-2820.002.patch, YARN-2820.003.patch, YARN-2820.004.patch, > YARN-2820.005.patch, YARN-2820.006.patch, YARN-2820.007.patch, > YARN-2820.007.patch > > > Do retry in FileSystemRMStateStore for better error recovery when > update/store failure due to IOException. > When we use FileSystemRMStateStore as yarn.resourcemanager.store.class, We > saw the following IOexception cause the RM shutdown. > {code} > 2014-10-29 23:49:12,202 INFO > org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore: > Updating info for attempt: appattempt_1409135750325_109118_01 at: > /tmp/hadoop-yarn/yarn/system/rmstore/FSRMStateRoot/RMAppRoot/application_1409135750325_109118/ > appattempt_1409135750325_109118_01 > 2014-10-29 23:49:19,495 INFO org.apache.hadoop.hdfs.DFSClient: Could not > complete > /tmp/hadoop-yarn/yarn/system/rmstore/FSRMStateRoot/RMAppRoot/application_1409135750325_109118/ > appattempt_1409135750325_109118_01.new.tmp retrying... > 2014-10-29 23:49:23,757 INFO org.apache.hadoop.hdfs.DFSClient: Could not > complete > /tmp/hadoop-yarn/yarn/system/rmstore/FSRMStateRoot/RMAppRoot/application_1409135750325_109118/ > appattempt_1409135750325_109118_01.new.tmp retrying... > 2014-10-29 23:49:31,120 INFO org.apache.hadoop.hdfs.DFSClient: Could not > complete > /tmp/hadoop-yarn/yarn/system/rmstore/FSRMStateRoot/RMAppRoot/application_1409135750325_109118/ > appattempt_1409135750325_109118_01.new.tmp retrying... > 2014-10-29 23:49:46,283 INFO > org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore: > Error updating info for attempt: appattempt_1409135750325_109118_01 > java.io.IOException: Unable to close file because the last block does not > have enough number of replicas. > 2014-10-29 23:49:46,284 ERROR > org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore: > Error storing/updating appAttempt: appattempt_1409135750325_109118_01 > 2014-10-29 23:49:46,916 FATAL > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: > Received a org.apache.hadoop.yarn.server.resourcemanager.RMFatalEvent of type > STATE_STORE_OP_FAILED. Cause: > java.io.IOException: Unable to close file because the last block does not > have enough number of replicas. > at > org.apache.hadoop.hdfs.DFSOutputStream.completeFile(DFSOutputStream.java:2132) > > at org.apache.hadoop.hdfs.DFSOutputStream.close(DFSOutputStream.java:2100) > at > org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:70) > > at org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:103) > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore.writeFile(FileSystemRMStateStore.java:522) > > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore.updateFile(FileSystemRMStateStore.java:534) > > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore.updateApplicationAttemptStateInternal(FileSystemRMStateStore.java:389) > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore.handleStoreEvent(RMStateStore.java:675) > > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:766) > > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:761) > > at > org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:173) > > at > org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:106) > at java.lang.Thread.run(Thread.java:744) > {code} > As discussed at YARN-1778, TestFSRMStateStore failure is also due to > IOException in storeApplicationStateInternal. > Stack trace from TestFSRMStateStore failure: > {code} > 2015-02-03 00:09:19,092 INFO [Thread-110] recovery.TestFSRMStateStore > (TestFSRMStateStore.java:run(285)) - testFSRMStateStoreClientRetry: Exception > org.apache.hadoop.ipc.RemoteException(java.io.IOException): NameNode still > not started >at > org.apac
[jira] [Commented] (YARN-2820) Do retry in FileSystemRMStateStore for better error recovery when update/store failure due to IOException.
[ https://issues.apache.org/jira/browse/YARN-2820?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14340304#comment-14340304 ] Tsuyoshi Ozawa commented on YARN-2820: -- [~zxu] My bad. Closeable should be idempotent, so it's OK. http://docs.oracle.com/javase/7/docs/api/java/lang/AutoCloseable.html Please ignore the above comment. > Do retry in FileSystemRMStateStore for better error recovery when > update/store failure due to IOException. > -- > > Key: YARN-2820 > URL: https://issues.apache.org/jira/browse/YARN-2820 > Project: Hadoop YARN > Issue Type: Improvement > Components: resourcemanager >Affects Versions: 2.5.0, 2.6.0 >Reporter: zhihai xu >Assignee: zhihai xu > Attachments: YARN-2820.000.patch, YARN-2820.001.patch, > YARN-2820.002.patch, YARN-2820.003.patch, YARN-2820.004.patch, > YARN-2820.005.patch, YARN-2820.006.patch, YARN-2820.007.patch, > YARN-2820.007.patch > > > Do retry in FileSystemRMStateStore for better error recovery when > update/store failure due to IOException. > When we use FileSystemRMStateStore as yarn.resourcemanager.store.class, We > saw the following IOexception cause the RM shutdown. > {code} > 2014-10-29 23:49:12,202 INFO > org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore: > Updating info for attempt: appattempt_1409135750325_109118_01 at: > /tmp/hadoop-yarn/yarn/system/rmstore/FSRMStateRoot/RMAppRoot/application_1409135750325_109118/ > appattempt_1409135750325_109118_01 > 2014-10-29 23:49:19,495 INFO org.apache.hadoop.hdfs.DFSClient: Could not > complete > /tmp/hadoop-yarn/yarn/system/rmstore/FSRMStateRoot/RMAppRoot/application_1409135750325_109118/ > appattempt_1409135750325_109118_01.new.tmp retrying... > 2014-10-29 23:49:23,757 INFO org.apache.hadoop.hdfs.DFSClient: Could not > complete > /tmp/hadoop-yarn/yarn/system/rmstore/FSRMStateRoot/RMAppRoot/application_1409135750325_109118/ > appattempt_1409135750325_109118_01.new.tmp retrying... > 2014-10-29 23:49:31,120 INFO org.apache.hadoop.hdfs.DFSClient: Could not > complete > /tmp/hadoop-yarn/yarn/system/rmstore/FSRMStateRoot/RMAppRoot/application_1409135750325_109118/ > appattempt_1409135750325_109118_01.new.tmp retrying... > 2014-10-29 23:49:46,283 INFO > org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore: > Error updating info for attempt: appattempt_1409135750325_109118_01 > java.io.IOException: Unable to close file because the last block does not > have enough number of replicas. > 2014-10-29 23:49:46,284 ERROR > org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore: > Error storing/updating appAttempt: appattempt_1409135750325_109118_01 > 2014-10-29 23:49:46,916 FATAL > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: > Received a org.apache.hadoop.yarn.server.resourcemanager.RMFatalEvent of type > STATE_STORE_OP_FAILED. Cause: > java.io.IOException: Unable to close file because the last block does not > have enough number of replicas. > at > org.apache.hadoop.hdfs.DFSOutputStream.completeFile(DFSOutputStream.java:2132) > > at org.apache.hadoop.hdfs.DFSOutputStream.close(DFSOutputStream.java:2100) > at > org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:70) > > at org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:103) > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore.writeFile(FileSystemRMStateStore.java:522) > > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore.updateFile(FileSystemRMStateStore.java:534) > > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore.updateApplicationAttemptStateInternal(FileSystemRMStateStore.java:389) > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore.handleStoreEvent(RMStateStore.java:675) > > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:766) > > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:761) > > at > org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:173) > > at > org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:106) > at java.lang.Thread.run(Thread.java:744) > {code} > As discussed at YARN-1778, TestFSRMStateStore failure is also due to > IOException in storeApplicationStateInternal. > Stack trace from TestFSRMStateStore failure: > {code} > 2015-02-03 00:09:19,092 INFO [Thread-110] recovery.TestFSRMStateStore > (TestFSRMStateStore.java:run(285)) - testFSRMStateStoreClie
[jira] [Commented] (YARN-2820) Do retry in FileSystemRMStateStore for better error recovery when update/store failure due to IOException.
[ https://issues.apache.org/jira/browse/YARN-2820?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14340265#comment-14340265 ] Tsuyoshi Ozawa commented on YARN-2820: -- [~zxu] Thank you for updating. I rethink abut closeInternal(). If we call fs.close() twice or more, it can close another file descriptor unexpectedly. It can lead unexpected behaviours. We should remove closeWithRetries and call fs.close() in closeInternal() to avoid the problems. What do you think? Thank you for dealing with iterative reviews. > Do retry in FileSystemRMStateStore for better error recovery when > update/store failure due to IOException. > -- > > Key: YARN-2820 > URL: https://issues.apache.org/jira/browse/YARN-2820 > Project: Hadoop YARN > Issue Type: Improvement > Components: resourcemanager >Affects Versions: 2.5.0, 2.6.0 >Reporter: zhihai xu >Assignee: zhihai xu > Attachments: YARN-2820.000.patch, YARN-2820.001.patch, > YARN-2820.002.patch, YARN-2820.003.patch, YARN-2820.004.patch, > YARN-2820.005.patch, YARN-2820.006.patch, YARN-2820.007.patch, > YARN-2820.007.patch > > > Do retry in FileSystemRMStateStore for better error recovery when > update/store failure due to IOException. > When we use FileSystemRMStateStore as yarn.resourcemanager.store.class, We > saw the following IOexception cause the RM shutdown. > {code} > 2014-10-29 23:49:12,202 INFO > org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore: > Updating info for attempt: appattempt_1409135750325_109118_01 at: > /tmp/hadoop-yarn/yarn/system/rmstore/FSRMStateRoot/RMAppRoot/application_1409135750325_109118/ > appattempt_1409135750325_109118_01 > 2014-10-29 23:49:19,495 INFO org.apache.hadoop.hdfs.DFSClient: Could not > complete > /tmp/hadoop-yarn/yarn/system/rmstore/FSRMStateRoot/RMAppRoot/application_1409135750325_109118/ > appattempt_1409135750325_109118_01.new.tmp retrying... > 2014-10-29 23:49:23,757 INFO org.apache.hadoop.hdfs.DFSClient: Could not > complete > /tmp/hadoop-yarn/yarn/system/rmstore/FSRMStateRoot/RMAppRoot/application_1409135750325_109118/ > appattempt_1409135750325_109118_01.new.tmp retrying... > 2014-10-29 23:49:31,120 INFO org.apache.hadoop.hdfs.DFSClient: Could not > complete > /tmp/hadoop-yarn/yarn/system/rmstore/FSRMStateRoot/RMAppRoot/application_1409135750325_109118/ > appattempt_1409135750325_109118_01.new.tmp retrying... > 2014-10-29 23:49:46,283 INFO > org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore: > Error updating info for attempt: appattempt_1409135750325_109118_01 > java.io.IOException: Unable to close file because the last block does not > have enough number of replicas. > 2014-10-29 23:49:46,284 ERROR > org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore: > Error storing/updating appAttempt: appattempt_1409135750325_109118_01 > 2014-10-29 23:49:46,916 FATAL > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: > Received a org.apache.hadoop.yarn.server.resourcemanager.RMFatalEvent of type > STATE_STORE_OP_FAILED. Cause: > java.io.IOException: Unable to close file because the last block does not > have enough number of replicas. > at > org.apache.hadoop.hdfs.DFSOutputStream.completeFile(DFSOutputStream.java:2132) > > at org.apache.hadoop.hdfs.DFSOutputStream.close(DFSOutputStream.java:2100) > at > org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:70) > > at org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:103) > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore.writeFile(FileSystemRMStateStore.java:522) > > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore.updateFile(FileSystemRMStateStore.java:534) > > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore.updateApplicationAttemptStateInternal(FileSystemRMStateStore.java:389) > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore.handleStoreEvent(RMStateStore.java:675) > > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:766) > > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:761) > > at > org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:173) > > at > org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:106) > at java.lang.Thread.run(Thread.java:744) > {code} > As discussed at YARN-1778, TestFSRMStateStore failure is also due to > IOException in storeApplicationStateInternal. > Sta
[jira] [Commented] (YARN-2820) Do retry in FileSystemRMStateStore for better error recovery when update/store failure due to IOException.
[ https://issues.apache.org/jira/browse/YARN-2820?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14340148#comment-14340148 ] Hadoop QA commented on YARN-2820: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12701354/YARN-2820.007.patch against trunk revision 4f75b15. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:red}-1 findbugs{color}. The patch appears to introduce 5 new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/6780//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-YARN-Build/6780//artifact/patchprocess/newPatchFindbugsWarningshadoop-yarn-server-resourcemanager.html Console output: https://builds.apache.org/job/PreCommit-YARN-Build/6780//console This message is automatically generated. > Do retry in FileSystemRMStateStore for better error recovery when > update/store failure due to IOException. > -- > > Key: YARN-2820 > URL: https://issues.apache.org/jira/browse/YARN-2820 > Project: Hadoop YARN > Issue Type: Improvement > Components: resourcemanager >Affects Versions: 2.5.0, 2.6.0 >Reporter: zhihai xu >Assignee: zhihai xu > Attachments: YARN-2820.000.patch, YARN-2820.001.patch, > YARN-2820.002.patch, YARN-2820.003.patch, YARN-2820.004.patch, > YARN-2820.005.patch, YARN-2820.006.patch, YARN-2820.007.patch, > YARN-2820.007.patch > > > Do retry in FileSystemRMStateStore for better error recovery when > update/store failure due to IOException. > When we use FileSystemRMStateStore as yarn.resourcemanager.store.class, We > saw the following IOexception cause the RM shutdown. > {code} > 2014-10-29 23:49:12,202 INFO > org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore: > Updating info for attempt: appattempt_1409135750325_109118_01 at: > /tmp/hadoop-yarn/yarn/system/rmstore/FSRMStateRoot/RMAppRoot/application_1409135750325_109118/ > appattempt_1409135750325_109118_01 > 2014-10-29 23:49:19,495 INFO org.apache.hadoop.hdfs.DFSClient: Could not > complete > /tmp/hadoop-yarn/yarn/system/rmstore/FSRMStateRoot/RMAppRoot/application_1409135750325_109118/ > appattempt_1409135750325_109118_01.new.tmp retrying... > 2014-10-29 23:49:23,757 INFO org.apache.hadoop.hdfs.DFSClient: Could not > complete > /tmp/hadoop-yarn/yarn/system/rmstore/FSRMStateRoot/RMAppRoot/application_1409135750325_109118/ > appattempt_1409135750325_109118_01.new.tmp retrying... > 2014-10-29 23:49:31,120 INFO org.apache.hadoop.hdfs.DFSClient: Could not > complete > /tmp/hadoop-yarn/yarn/system/rmstore/FSRMStateRoot/RMAppRoot/application_1409135750325_109118/ > appattempt_1409135750325_109118_01.new.tmp retrying... > 2014-10-29 23:49:46,283 INFO > org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore: > Error updating info for attempt: appattempt_1409135750325_109118_01 > java.io.IOException: Unable to close file because the last block does not > have enough number of replicas. > 2014-10-29 23:49:46,284 ERROR > org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore: > Error storing/updating appAttempt: appattempt_1409135750325_109118_01 > 2014-10-29 23:49:46,916 FATAL > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: > Received a org.apache.hadoop.yarn.server.resourcemanager.RMFatalEvent of type > STATE_STORE_OP_FAILED. Cause: > java.io.IOException: Unable to close file because the last block does not > have enough number of replicas. > at > org.apache.hadoop.hdfs.DFSOutputStream.completeFile(DFSOutputStream.java:2132) > > at org.apache.hadoop.hdfs.DFSOutputStream.close(DFSOutputStream.java:2100) > at > org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:70) > > at org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStrea
[jira] [Commented] (YARN-2820) Do retry in FileSystemRMStateStore for better error recovery when update/store failure due to IOException.
[ https://issues.apache.org/jira/browse/YARN-2820?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14340023#comment-14340023 ] Hadoop QA commented on YARN-2820: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12701314/YARN-2820.007.patch against trunk revision 48c7ee7. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:red}-1 javac{color}. The applied patch generated 1151 javac compiler warnings (more than the trunk's current 205 warnings). {color:red}-1 javadoc{color}. The javadoc tool appears to have generated 47 warning messages. See https://builds.apache.org/job/PreCommit-YARN-Build/6778//artifact/patchprocess/diffJavadocWarnings.txt for details. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:red}-1 findbugs{color}. The patch appears to introduce 5 new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:red}-1 core tests{color}. The patch failed these unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager: org.apache.hadoop.yarn.server.resourcemanager.TestWorkPreservingRMRestart The following test timeouts occurred in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager: org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.TestCapacitySchedulerQueueACLs Test results: https://builds.apache.org/job/PreCommit-YARN-Build/6778//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-YARN-Build/6778//artifact/patchprocess/newPatchFindbugsWarningshadoop-yarn-server-resourcemanager.html Javac warnings: https://builds.apache.org/job/PreCommit-YARN-Build/6778//artifact/patchprocess/diffJavacWarnings.txt Console output: https://builds.apache.org/job/PreCommit-YARN-Build/6778//console This message is automatically generated. > Do retry in FileSystemRMStateStore for better error recovery when > update/store failure due to IOException. > -- > > Key: YARN-2820 > URL: https://issues.apache.org/jira/browse/YARN-2820 > Project: Hadoop YARN > Issue Type: Improvement > Components: resourcemanager >Affects Versions: 2.5.0, 2.6.0 >Reporter: zhihai xu >Assignee: zhihai xu > Attachments: YARN-2820.000.patch, YARN-2820.001.patch, > YARN-2820.002.patch, YARN-2820.003.patch, YARN-2820.004.patch, > YARN-2820.005.patch, YARN-2820.006.patch, YARN-2820.007.patch > > > Do retry in FileSystemRMStateStore for better error recovery when > update/store failure due to IOException. > When we use FileSystemRMStateStore as yarn.resourcemanager.store.class, We > saw the following IOexception cause the RM shutdown. > {code} > 2014-10-29 23:49:12,202 INFO > org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore: > Updating info for attempt: appattempt_1409135750325_109118_01 at: > /tmp/hadoop-yarn/yarn/system/rmstore/FSRMStateRoot/RMAppRoot/application_1409135750325_109118/ > appattempt_1409135750325_109118_01 > 2014-10-29 23:49:19,495 INFO org.apache.hadoop.hdfs.DFSClient: Could not > complete > /tmp/hadoop-yarn/yarn/system/rmstore/FSRMStateRoot/RMAppRoot/application_1409135750325_109118/ > appattempt_1409135750325_109118_01.new.tmp retrying... > 2014-10-29 23:49:23,757 INFO org.apache.hadoop.hdfs.DFSClient: Could not > complete > /tmp/hadoop-yarn/yarn/system/rmstore/FSRMStateRoot/RMAppRoot/application_1409135750325_109118/ > appattempt_1409135750325_109118_01.new.tmp retrying... > 2014-10-29 23:49:31,120 INFO org.apache.hadoop.hdfs.DFSClient: Could not > complete > /tmp/hadoop-yarn/yarn/system/rmstore/FSRMStateRoot/RMAppRoot/application_1409135750325_109118/ > appattempt_1409135750325_109118_01.new.tmp retrying... > 2014-10-29 23:49:46,283 INFO > org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore: > Error updating info for attempt: appattempt_1409135750325_109118_01 > java.io.IOException: Unable to close file because the last block does not > have enough number of replicas. > 2014-10-29 23:49:46,284 ERROR > org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore: > Error storing/updating appAttempt: appattempt_1409135750325_109118_01 > 2014-10-29 23:49:46,916 FATAL > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: > Received a org.apache.hadoop.yarn.s
[jira] [Commented] (YARN-2820) Do retry in FileSystemRMStateStore for better error recovery when update/store failure due to IOException.
[ https://issues.apache.org/jira/browse/YARN-2820?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14339954#comment-14339954 ] zhihai xu commented on YARN-2820: - [~ozawa], Cool, I just learned this new syntax. I uploaded a new patch YARN-2820.007.patch which use try-with-resources statement. Please review it. thanks zhihai > Do retry in FileSystemRMStateStore for better error recovery when > update/store failure due to IOException. > -- > > Key: YARN-2820 > URL: https://issues.apache.org/jira/browse/YARN-2820 > Project: Hadoop YARN > Issue Type: Improvement > Components: resourcemanager >Affects Versions: 2.5.0, 2.6.0 >Reporter: zhihai xu >Assignee: zhihai xu > Attachments: YARN-2820.000.patch, YARN-2820.001.patch, > YARN-2820.002.patch, YARN-2820.003.patch, YARN-2820.004.patch, > YARN-2820.005.patch, YARN-2820.006.patch, YARN-2820.007.patch > > > Do retry in FileSystemRMStateStore for better error recovery when > update/store failure due to IOException. > When we use FileSystemRMStateStore as yarn.resourcemanager.store.class, We > saw the following IOexception cause the RM shutdown. > {code} > 2014-10-29 23:49:12,202 INFO > org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore: > Updating info for attempt: appattempt_1409135750325_109118_01 at: > /tmp/hadoop-yarn/yarn/system/rmstore/FSRMStateRoot/RMAppRoot/application_1409135750325_109118/ > appattempt_1409135750325_109118_01 > 2014-10-29 23:49:19,495 INFO org.apache.hadoop.hdfs.DFSClient: Could not > complete > /tmp/hadoop-yarn/yarn/system/rmstore/FSRMStateRoot/RMAppRoot/application_1409135750325_109118/ > appattempt_1409135750325_109118_01.new.tmp retrying... > 2014-10-29 23:49:23,757 INFO org.apache.hadoop.hdfs.DFSClient: Could not > complete > /tmp/hadoop-yarn/yarn/system/rmstore/FSRMStateRoot/RMAppRoot/application_1409135750325_109118/ > appattempt_1409135750325_109118_01.new.tmp retrying... > 2014-10-29 23:49:31,120 INFO org.apache.hadoop.hdfs.DFSClient: Could not > complete > /tmp/hadoop-yarn/yarn/system/rmstore/FSRMStateRoot/RMAppRoot/application_1409135750325_109118/ > appattempt_1409135750325_109118_01.new.tmp retrying... > 2014-10-29 23:49:46,283 INFO > org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore: > Error updating info for attempt: appattempt_1409135750325_109118_01 > java.io.IOException: Unable to close file because the last block does not > have enough number of replicas. > 2014-10-29 23:49:46,284 ERROR > org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore: > Error storing/updating appAttempt: appattempt_1409135750325_109118_01 > 2014-10-29 23:49:46,916 FATAL > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: > Received a org.apache.hadoop.yarn.server.resourcemanager.RMFatalEvent of type > STATE_STORE_OP_FAILED. Cause: > java.io.IOException: Unable to close file because the last block does not > have enough number of replicas. > at > org.apache.hadoop.hdfs.DFSOutputStream.completeFile(DFSOutputStream.java:2132) > > at org.apache.hadoop.hdfs.DFSOutputStream.close(DFSOutputStream.java:2100) > at > org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:70) > > at org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:103) > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore.writeFile(FileSystemRMStateStore.java:522) > > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore.updateFile(FileSystemRMStateStore.java:534) > > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore.updateApplicationAttemptStateInternal(FileSystemRMStateStore.java:389) > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore.handleStoreEvent(RMStateStore.java:675) > > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:766) > > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:761) > > at > org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:173) > > at > org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:106) > at java.lang.Thread.run(Thread.java:744) > {code} > As discussed at YARN-1778, TestFSRMStateStore failure is also due to > IOException in storeApplicationStateInternal. > Stack trace from TestFSRMStateStore failure: > {code} > 2015-02-03 00:09:19,092 INFO [Thread-110] recovery.TestFSRMStateStore > (TestFSRMStateStore.java:run(285)) - testFSRMStateStoreClientRetry: Exception > org.apache.h
[jira] [Commented] (YARN-2820) Do retry in FileSystemRMStateStore for better error recovery when update/store failure due to IOException.
[ https://issues.apache.org/jira/browse/YARN-2820?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14339910#comment-14339910 ] Tsuyoshi Ozawa commented on YARN-2820: -- [~zxu] try-with-resources statement is a new statement from JDK7 for instances which implements Closable: http://docs.oracle.com/javase/tutorial/essential/exceptions/tryResourceClose.html {code} try (ByteArrayInputStream is = new ByteArrayInputStream(childData); DataInputStream fsIn = new DataInputStream(is);){ // processing something here } // closes is and fsIn automatically after the block. {code} It's useful since we don't need finally block with null check. > Do retry in FileSystemRMStateStore for better error recovery when > update/store failure due to IOException. > -- > > Key: YARN-2820 > URL: https://issues.apache.org/jira/browse/YARN-2820 > Project: Hadoop YARN > Issue Type: Improvement > Components: resourcemanager >Affects Versions: 2.5.0, 2.6.0 >Reporter: zhihai xu >Assignee: zhihai xu > Attachments: YARN-2820.000.patch, YARN-2820.001.patch, > YARN-2820.002.patch, YARN-2820.003.patch, YARN-2820.004.patch, > YARN-2820.005.patch, YARN-2820.006.patch > > > Do retry in FileSystemRMStateStore for better error recovery when > update/store failure due to IOException. > When we use FileSystemRMStateStore as yarn.resourcemanager.store.class, We > saw the following IOexception cause the RM shutdown. > {code} > 2014-10-29 23:49:12,202 INFO > org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore: > Updating info for attempt: appattempt_1409135750325_109118_01 at: > /tmp/hadoop-yarn/yarn/system/rmstore/FSRMStateRoot/RMAppRoot/application_1409135750325_109118/ > appattempt_1409135750325_109118_01 > 2014-10-29 23:49:19,495 INFO org.apache.hadoop.hdfs.DFSClient: Could not > complete > /tmp/hadoop-yarn/yarn/system/rmstore/FSRMStateRoot/RMAppRoot/application_1409135750325_109118/ > appattempt_1409135750325_109118_01.new.tmp retrying... > 2014-10-29 23:49:23,757 INFO org.apache.hadoop.hdfs.DFSClient: Could not > complete > /tmp/hadoop-yarn/yarn/system/rmstore/FSRMStateRoot/RMAppRoot/application_1409135750325_109118/ > appattempt_1409135750325_109118_01.new.tmp retrying... > 2014-10-29 23:49:31,120 INFO org.apache.hadoop.hdfs.DFSClient: Could not > complete > /tmp/hadoop-yarn/yarn/system/rmstore/FSRMStateRoot/RMAppRoot/application_1409135750325_109118/ > appattempt_1409135750325_109118_01.new.tmp retrying... > 2014-10-29 23:49:46,283 INFO > org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore: > Error updating info for attempt: appattempt_1409135750325_109118_01 > java.io.IOException: Unable to close file because the last block does not > have enough number of replicas. > 2014-10-29 23:49:46,284 ERROR > org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore: > Error storing/updating appAttempt: appattempt_1409135750325_109118_01 > 2014-10-29 23:49:46,916 FATAL > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: > Received a org.apache.hadoop.yarn.server.resourcemanager.RMFatalEvent of type > STATE_STORE_OP_FAILED. Cause: > java.io.IOException: Unable to close file because the last block does not > have enough number of replicas. > at > org.apache.hadoop.hdfs.DFSOutputStream.completeFile(DFSOutputStream.java:2132) > > at org.apache.hadoop.hdfs.DFSOutputStream.close(DFSOutputStream.java:2100) > at > org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:70) > > at org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:103) > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore.writeFile(FileSystemRMStateStore.java:522) > > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore.updateFile(FileSystemRMStateStore.java:534) > > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore.updateApplicationAttemptStateInternal(FileSystemRMStateStore.java:389) > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore.handleStoreEvent(RMStateStore.java:675) > > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:766) > > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:761) > > at > org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:173) > > at > org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:106) > at java.lang.Thread.run(Thread.java:744) > {code} > As discussed at YARN-1778, TestFSRMStateSto
[jira] [Commented] (YARN-2820) Do retry in FileSystemRMStateStore for better error recovery when update/store failure due to IOException.
[ https://issues.apache.org/jira/browse/YARN-2820?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14339774#comment-14339774 ] Hadoop QA commented on YARN-2820: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12701267/YARN-2820.006.patch against trunk revision 8ca0d95. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:red}-1 findbugs{color}. The patch appears to introduce 5 new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/6775//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-YARN-Build/6775//artifact/patchprocess/newPatchFindbugsWarningshadoop-yarn-server-resourcemanager.html Console output: https://builds.apache.org/job/PreCommit-YARN-Build/6775//console This message is automatically generated. > Do retry in FileSystemRMStateStore for better error recovery when > update/store failure due to IOException. > -- > > Key: YARN-2820 > URL: https://issues.apache.org/jira/browse/YARN-2820 > Project: Hadoop YARN > Issue Type: Improvement > Components: resourcemanager >Affects Versions: 2.5.0, 2.6.0 >Reporter: zhihai xu >Assignee: zhihai xu > Attachments: YARN-2820.000.patch, YARN-2820.001.patch, > YARN-2820.002.patch, YARN-2820.003.patch, YARN-2820.004.patch, > YARN-2820.005.patch, YARN-2820.006.patch > > > Do retry in FileSystemRMStateStore for better error recovery when > update/store failure due to IOException. > When we use FileSystemRMStateStore as yarn.resourcemanager.store.class, We > saw the following IOexception cause the RM shutdown. > {code} > 2014-10-29 23:49:12,202 INFO > org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore: > Updating info for attempt: appattempt_1409135750325_109118_01 at: > /tmp/hadoop-yarn/yarn/system/rmstore/FSRMStateRoot/RMAppRoot/application_1409135750325_109118/ > appattempt_1409135750325_109118_01 > 2014-10-29 23:49:19,495 INFO org.apache.hadoop.hdfs.DFSClient: Could not > complete > /tmp/hadoop-yarn/yarn/system/rmstore/FSRMStateRoot/RMAppRoot/application_1409135750325_109118/ > appattempt_1409135750325_109118_01.new.tmp retrying... > 2014-10-29 23:49:23,757 INFO org.apache.hadoop.hdfs.DFSClient: Could not > complete > /tmp/hadoop-yarn/yarn/system/rmstore/FSRMStateRoot/RMAppRoot/application_1409135750325_109118/ > appattempt_1409135750325_109118_01.new.tmp retrying... > 2014-10-29 23:49:31,120 INFO org.apache.hadoop.hdfs.DFSClient: Could not > complete > /tmp/hadoop-yarn/yarn/system/rmstore/FSRMStateRoot/RMAppRoot/application_1409135750325_109118/ > appattempt_1409135750325_109118_01.new.tmp retrying... > 2014-10-29 23:49:46,283 INFO > org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore: > Error updating info for attempt: appattempt_1409135750325_109118_01 > java.io.IOException: Unable to close file because the last block does not > have enough number of replicas. > 2014-10-29 23:49:46,284 ERROR > org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore: > Error storing/updating appAttempt: appattempt_1409135750325_109118_01 > 2014-10-29 23:49:46,916 FATAL > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: > Received a org.apache.hadoop.yarn.server.resourcemanager.RMFatalEvent of type > STATE_STORE_OP_FAILED. Cause: > java.io.IOException: Unable to close file because the last block does not > have enough number of replicas. > at > org.apache.hadoop.hdfs.DFSOutputStream.completeFile(DFSOutputStream.java:2132) > > at org.apache.hadoop.hdfs.DFSOutputStream.close(DFSOutputStream.java:2100) > at > org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:70) > > at org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:103) > at > org.apache.hadoop.yarn.s
[jira] [Commented] (YARN-2820) Do retry in FileSystemRMStateStore for better error recovery when update/store failure due to IOException.
[ https://issues.apache.org/jira/browse/YARN-2820?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14339716#comment-14339716 ] zhihai xu commented on YARN-2820: - [~ozawa], thanks for your thorough review, I am really appreciated. I uploaded a new patch YARN-2820.005.patch, which addressed all your comments, It also put fsIn.close in try-with-resources at loadRMDTSecretManagerState, which is similar as fsOut.close at storeRMDTMasterKeyState. please review it, thanks zhihai > Do retry in FileSystemRMStateStore for better error recovery when > update/store failure due to IOException. > -- > > Key: YARN-2820 > URL: https://issues.apache.org/jira/browse/YARN-2820 > Project: Hadoop YARN > Issue Type: Improvement > Components: resourcemanager >Affects Versions: 2.5.0, 2.6.0 >Reporter: zhihai xu >Assignee: zhihai xu > Attachments: YARN-2820.000.patch, YARN-2820.001.patch, > YARN-2820.002.patch, YARN-2820.003.patch, YARN-2820.004.patch, > YARN-2820.005.patch, YARN-2820.006.patch > > > Do retry in FileSystemRMStateStore for better error recovery when > update/store failure due to IOException. > When we use FileSystemRMStateStore as yarn.resourcemanager.store.class, We > saw the following IOexception cause the RM shutdown. > {code} > 2014-10-29 23:49:12,202 INFO > org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore: > Updating info for attempt: appattempt_1409135750325_109118_01 at: > /tmp/hadoop-yarn/yarn/system/rmstore/FSRMStateRoot/RMAppRoot/application_1409135750325_109118/ > appattempt_1409135750325_109118_01 > 2014-10-29 23:49:19,495 INFO org.apache.hadoop.hdfs.DFSClient: Could not > complete > /tmp/hadoop-yarn/yarn/system/rmstore/FSRMStateRoot/RMAppRoot/application_1409135750325_109118/ > appattempt_1409135750325_109118_01.new.tmp retrying... > 2014-10-29 23:49:23,757 INFO org.apache.hadoop.hdfs.DFSClient: Could not > complete > /tmp/hadoop-yarn/yarn/system/rmstore/FSRMStateRoot/RMAppRoot/application_1409135750325_109118/ > appattempt_1409135750325_109118_01.new.tmp retrying... > 2014-10-29 23:49:31,120 INFO org.apache.hadoop.hdfs.DFSClient: Could not > complete > /tmp/hadoop-yarn/yarn/system/rmstore/FSRMStateRoot/RMAppRoot/application_1409135750325_109118/ > appattempt_1409135750325_109118_01.new.tmp retrying... > 2014-10-29 23:49:46,283 INFO > org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore: > Error updating info for attempt: appattempt_1409135750325_109118_01 > java.io.IOException: Unable to close file because the last block does not > have enough number of replicas. > 2014-10-29 23:49:46,284 ERROR > org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore: > Error storing/updating appAttempt: appattempt_1409135750325_109118_01 > 2014-10-29 23:49:46,916 FATAL > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: > Received a org.apache.hadoop.yarn.server.resourcemanager.RMFatalEvent of type > STATE_STORE_OP_FAILED. Cause: > java.io.IOException: Unable to close file because the last block does not > have enough number of replicas. > at > org.apache.hadoop.hdfs.DFSOutputStream.completeFile(DFSOutputStream.java:2132) > > at org.apache.hadoop.hdfs.DFSOutputStream.close(DFSOutputStream.java:2100) > at > org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:70) > > at org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:103) > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore.writeFile(FileSystemRMStateStore.java:522) > > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore.updateFile(FileSystemRMStateStore.java:534) > > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore.updateApplicationAttemptStateInternal(FileSystemRMStateStore.java:389) > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore.handleStoreEvent(RMStateStore.java:675) > > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:766) > > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:761) > > at > org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:173) > > at > org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:106) > at java.lang.Thread.run(Thread.java:744) > {code} > As discussed at YARN-1778, TestFSRMStateStore failure is also due to > IOException in storeApplicationStateInternal. > Stack trace from TestFSRMStateStore failure: > {code} > 2015-02-03 00:09:19,092 INFO [Thre
[jira] [Commented] (YARN-2820) Do retry in FileSystemRMStateStore for better error recovery when update/store failure due to IOException.
[ https://issues.apache.org/jira/browse/YARN-2820?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14339703#comment-14339703 ] Tsuyoshi Ozawa commented on YARN-2820: -- Good catch! Yes, we should retry there also. > Do retry in FileSystemRMStateStore for better error recovery when > update/store failure due to IOException. > -- > > Key: YARN-2820 > URL: https://issues.apache.org/jira/browse/YARN-2820 > Project: Hadoop YARN > Issue Type: Improvement > Components: resourcemanager >Affects Versions: 2.5.0, 2.6.0 >Reporter: zhihai xu >Assignee: zhihai xu > Attachments: YARN-2820.000.patch, YARN-2820.001.patch, > YARN-2820.002.patch, YARN-2820.003.patch, YARN-2820.004.patch, > YARN-2820.005.patch > > > Do retry in FileSystemRMStateStore for better error recovery when > update/store failure due to IOException. > When we use FileSystemRMStateStore as yarn.resourcemanager.store.class, We > saw the following IOexception cause the RM shutdown. > {code} > 2014-10-29 23:49:12,202 INFO > org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore: > Updating info for attempt: appattempt_1409135750325_109118_01 at: > /tmp/hadoop-yarn/yarn/system/rmstore/FSRMStateRoot/RMAppRoot/application_1409135750325_109118/ > appattempt_1409135750325_109118_01 > 2014-10-29 23:49:19,495 INFO org.apache.hadoop.hdfs.DFSClient: Could not > complete > /tmp/hadoop-yarn/yarn/system/rmstore/FSRMStateRoot/RMAppRoot/application_1409135750325_109118/ > appattempt_1409135750325_109118_01.new.tmp retrying... > 2014-10-29 23:49:23,757 INFO org.apache.hadoop.hdfs.DFSClient: Could not > complete > /tmp/hadoop-yarn/yarn/system/rmstore/FSRMStateRoot/RMAppRoot/application_1409135750325_109118/ > appattempt_1409135750325_109118_01.new.tmp retrying... > 2014-10-29 23:49:31,120 INFO org.apache.hadoop.hdfs.DFSClient: Could not > complete > /tmp/hadoop-yarn/yarn/system/rmstore/FSRMStateRoot/RMAppRoot/application_1409135750325_109118/ > appattempt_1409135750325_109118_01.new.tmp retrying... > 2014-10-29 23:49:46,283 INFO > org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore: > Error updating info for attempt: appattempt_1409135750325_109118_01 > java.io.IOException: Unable to close file because the last block does not > have enough number of replicas. > 2014-10-29 23:49:46,284 ERROR > org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore: > Error storing/updating appAttempt: appattempt_1409135750325_109118_01 > 2014-10-29 23:49:46,916 FATAL > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: > Received a org.apache.hadoop.yarn.server.resourcemanager.RMFatalEvent of type > STATE_STORE_OP_FAILED. Cause: > java.io.IOException: Unable to close file because the last block does not > have enough number of replicas. > at > org.apache.hadoop.hdfs.DFSOutputStream.completeFile(DFSOutputStream.java:2132) > > at org.apache.hadoop.hdfs.DFSOutputStream.close(DFSOutputStream.java:2100) > at > org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:70) > > at org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:103) > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore.writeFile(FileSystemRMStateStore.java:522) > > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore.updateFile(FileSystemRMStateStore.java:534) > > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore.updateApplicationAttemptStateInternal(FileSystemRMStateStore.java:389) > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore.handleStoreEvent(RMStateStore.java:675) > > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:766) > > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:761) > > at > org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:173) > > at > org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:106) > at java.lang.Thread.run(Thread.java:744) > {code} > As discussed at YARN-1778, TestFSRMStateStore failure is also due to > IOException in storeApplicationStateInternal. > Stack trace from TestFSRMStateStore failure: > {code} > 2015-02-03 00:09:19,092 INFO [Thread-110] recovery.TestFSRMStateStore > (TestFSRMStateStore.java:run(285)) - testFSRMStateStoreClientRetry: Exception > org.apache.hadoop.ipc.RemoteException(java.io.IOException): NameNode still > not started >at > org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.c
[jira] [Commented] (YARN-2820) Do retry in FileSystemRMStateStore for better error recovery when update/store failure due to IOException.
[ https://issues.apache.org/jira/browse/YARN-2820?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14339592#comment-14339592 ] zhihai xu commented on YARN-2820: - That is good finding, I double-checked all the FS operations in FileSystemRMStateStore: With your above finding, there is one more missing: which is in closeInternal {code} fs.close(); {code} I will upload a new patch shortly to include retries for all these missing cases. > Do retry in FileSystemRMStateStore for better error recovery when > update/store failure due to IOException. > -- > > Key: YARN-2820 > URL: https://issues.apache.org/jira/browse/YARN-2820 > Project: Hadoop YARN > Issue Type: Improvement > Components: resourcemanager >Affects Versions: 2.5.0, 2.6.0 >Reporter: zhihai xu >Assignee: zhihai xu > Attachments: YARN-2820.000.patch, YARN-2820.001.patch, > YARN-2820.002.patch, YARN-2820.003.patch, YARN-2820.004.patch, > YARN-2820.005.patch > > > Do retry in FileSystemRMStateStore for better error recovery when > update/store failure due to IOException. > When we use FileSystemRMStateStore as yarn.resourcemanager.store.class, We > saw the following IOexception cause the RM shutdown. > {code} > 2014-10-29 23:49:12,202 INFO > org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore: > Updating info for attempt: appattempt_1409135750325_109118_01 at: > /tmp/hadoop-yarn/yarn/system/rmstore/FSRMStateRoot/RMAppRoot/application_1409135750325_109118/ > appattempt_1409135750325_109118_01 > 2014-10-29 23:49:19,495 INFO org.apache.hadoop.hdfs.DFSClient: Could not > complete > /tmp/hadoop-yarn/yarn/system/rmstore/FSRMStateRoot/RMAppRoot/application_1409135750325_109118/ > appattempt_1409135750325_109118_01.new.tmp retrying... > 2014-10-29 23:49:23,757 INFO org.apache.hadoop.hdfs.DFSClient: Could not > complete > /tmp/hadoop-yarn/yarn/system/rmstore/FSRMStateRoot/RMAppRoot/application_1409135750325_109118/ > appattempt_1409135750325_109118_01.new.tmp retrying... > 2014-10-29 23:49:31,120 INFO org.apache.hadoop.hdfs.DFSClient: Could not > complete > /tmp/hadoop-yarn/yarn/system/rmstore/FSRMStateRoot/RMAppRoot/application_1409135750325_109118/ > appattempt_1409135750325_109118_01.new.tmp retrying... > 2014-10-29 23:49:46,283 INFO > org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore: > Error updating info for attempt: appattempt_1409135750325_109118_01 > java.io.IOException: Unable to close file because the last block does not > have enough number of replicas. > 2014-10-29 23:49:46,284 ERROR > org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore: > Error storing/updating appAttempt: appattempt_1409135750325_109118_01 > 2014-10-29 23:49:46,916 FATAL > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: > Received a org.apache.hadoop.yarn.server.resourcemanager.RMFatalEvent of type > STATE_STORE_OP_FAILED. Cause: > java.io.IOException: Unable to close file because the last block does not > have enough number of replicas. > at > org.apache.hadoop.hdfs.DFSOutputStream.completeFile(DFSOutputStream.java:2132) > > at org.apache.hadoop.hdfs.DFSOutputStream.close(DFSOutputStream.java:2100) > at > org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:70) > > at org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:103) > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore.writeFile(FileSystemRMStateStore.java:522) > > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore.updateFile(FileSystemRMStateStore.java:534) > > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore.updateApplicationAttemptStateInternal(FileSystemRMStateStore.java:389) > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore.handleStoreEvent(RMStateStore.java:675) > > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:766) > > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:761) > > at > org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:173) > > at > org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:106) > at java.lang.Thread.run(Thread.java:744) > {code} > As discussed at YARN-1778, TestFSRMStateStore failure is also due to > IOException in storeApplicationStateInternal. > Stack trace from TestFSRMStateStore failure: > {code} > 2015-02-03 00:09:19,092 INFO [Thread-110] recovery.TestFSRMStateStore > (TestFSRMStateStore.ja
[jira] [Commented] (YARN-2820) Do retry in FileSystemRMStateStore for better error recovery when update/store failure due to IOException.
[ https://issues.apache.org/jira/browse/YARN-2820?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14338589#comment-14338589 ] Tsuyoshi Ozawa commented on YARN-2820: -- [~zxu] thanks for your updating! The implementation of FSAction looks good to me. I found following points to be fixed: 1. In startInternal, fs.mkdirs can be replaced with mkdirsWithRetries: {code} fs.mkdirs(rmDTSecretManagerRoot); fs.mkdirs(rmAppRoot); fs.mkdirs(amrmTokenSecretManagerRoot); {code} 2. All readFile() should be replaced with readFileWithRetries like writeFileWithRetries. 3. fs.listStatus() should be replaced with listStatusWithRetries. 4. We can use try-with-resources in storeRMDTMasterKeyState to close fsOut. I know it's not related to this patch, but it's better to be fixed here. {code} DataOutputStream fsOut = new DataOutputStream(os); {code} Do you mind updating a patch again? > Do retry in FileSystemRMStateStore for better error recovery when > update/store failure due to IOException. > -- > > Key: YARN-2820 > URL: https://issues.apache.org/jira/browse/YARN-2820 > Project: Hadoop YARN > Issue Type: Improvement > Components: resourcemanager >Affects Versions: 2.5.0, 2.6.0 >Reporter: zhihai xu >Assignee: zhihai xu > Attachments: YARN-2820.000.patch, YARN-2820.001.patch, > YARN-2820.002.patch, YARN-2820.003.patch, YARN-2820.004.patch, > YARN-2820.005.patch > > > Do retry in FileSystemRMStateStore for better error recovery when > update/store failure due to IOException. > When we use FileSystemRMStateStore as yarn.resourcemanager.store.class, We > saw the following IOexception cause the RM shutdown. > {code} > 2014-10-29 23:49:12,202 INFO > org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore: > Updating info for attempt: appattempt_1409135750325_109118_01 at: > /tmp/hadoop-yarn/yarn/system/rmstore/FSRMStateRoot/RMAppRoot/application_1409135750325_109118/ > appattempt_1409135750325_109118_01 > 2014-10-29 23:49:19,495 INFO org.apache.hadoop.hdfs.DFSClient: Could not > complete > /tmp/hadoop-yarn/yarn/system/rmstore/FSRMStateRoot/RMAppRoot/application_1409135750325_109118/ > appattempt_1409135750325_109118_01.new.tmp retrying... > 2014-10-29 23:49:23,757 INFO org.apache.hadoop.hdfs.DFSClient: Could not > complete > /tmp/hadoop-yarn/yarn/system/rmstore/FSRMStateRoot/RMAppRoot/application_1409135750325_109118/ > appattempt_1409135750325_109118_01.new.tmp retrying... > 2014-10-29 23:49:31,120 INFO org.apache.hadoop.hdfs.DFSClient: Could not > complete > /tmp/hadoop-yarn/yarn/system/rmstore/FSRMStateRoot/RMAppRoot/application_1409135750325_109118/ > appattempt_1409135750325_109118_01.new.tmp retrying... > 2014-10-29 23:49:46,283 INFO > org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore: > Error updating info for attempt: appattempt_1409135750325_109118_01 > java.io.IOException: Unable to close file because the last block does not > have enough number of replicas. > 2014-10-29 23:49:46,284 ERROR > org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore: > Error storing/updating appAttempt: appattempt_1409135750325_109118_01 > 2014-10-29 23:49:46,916 FATAL > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: > Received a org.apache.hadoop.yarn.server.resourcemanager.RMFatalEvent of type > STATE_STORE_OP_FAILED. Cause: > java.io.IOException: Unable to close file because the last block does not > have enough number of replicas. > at > org.apache.hadoop.hdfs.DFSOutputStream.completeFile(DFSOutputStream.java:2132) > > at org.apache.hadoop.hdfs.DFSOutputStream.close(DFSOutputStream.java:2100) > at > org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:70) > > at org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:103) > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore.writeFile(FileSystemRMStateStore.java:522) > > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore.updateFile(FileSystemRMStateStore.java:534) > > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore.updateApplicationAttemptStateInternal(FileSystemRMStateStore.java:389) > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore.handleStoreEvent(RMStateStore.java:675) > > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:766) > > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:761) > > at > org.apache.hadoop.yarn.event.AsyncDis
[jira] [Commented] (YARN-2820) Do retry in FileSystemRMStateStore for better error recovery when update/store failure due to IOException.
[ https://issues.apache.org/jira/browse/YARN-2820?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14338170#comment-14338170 ] Hadoop QA commented on YARN-2820: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12700999/YARN-2820.005.patch against trunk revision 71385f9. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:red}-1 findbugs{color}. The patch appears to introduce 6 new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/6753//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-YARN-Build/6753//artifact/patchprocess/newPatchFindbugsWarningshadoop-yarn-server-resourcemanager.html Console output: https://builds.apache.org/job/PreCommit-YARN-Build/6753//console This message is automatically generated. > Do retry in FileSystemRMStateStore for better error recovery when > update/store failure due to IOException. > -- > > Key: YARN-2820 > URL: https://issues.apache.org/jira/browse/YARN-2820 > Project: Hadoop YARN > Issue Type: Improvement > Components: resourcemanager >Affects Versions: 2.5.0, 2.6.0 >Reporter: zhihai xu >Assignee: zhihai xu > Attachments: YARN-2820.000.patch, YARN-2820.001.patch, > YARN-2820.002.patch, YARN-2820.003.patch, YARN-2820.004.patch, > YARN-2820.005.patch > > > Do retry in FileSystemRMStateStore for better error recovery when > update/store failure due to IOException. > When we use FileSystemRMStateStore as yarn.resourcemanager.store.class, We > saw the following IOexception cause the RM shutdown. > {code} > 2014-10-29 23:49:12,202 INFO > org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore: > Updating info for attempt: appattempt_1409135750325_109118_01 at: > /tmp/hadoop-yarn/yarn/system/rmstore/FSRMStateRoot/RMAppRoot/application_1409135750325_109118/ > appattempt_1409135750325_109118_01 > 2014-10-29 23:49:19,495 INFO org.apache.hadoop.hdfs.DFSClient: Could not > complete > /tmp/hadoop-yarn/yarn/system/rmstore/FSRMStateRoot/RMAppRoot/application_1409135750325_109118/ > appattempt_1409135750325_109118_01.new.tmp retrying... > 2014-10-29 23:49:23,757 INFO org.apache.hadoop.hdfs.DFSClient: Could not > complete > /tmp/hadoop-yarn/yarn/system/rmstore/FSRMStateRoot/RMAppRoot/application_1409135750325_109118/ > appattempt_1409135750325_109118_01.new.tmp retrying... > 2014-10-29 23:49:31,120 INFO org.apache.hadoop.hdfs.DFSClient: Could not > complete > /tmp/hadoop-yarn/yarn/system/rmstore/FSRMStateRoot/RMAppRoot/application_1409135750325_109118/ > appattempt_1409135750325_109118_01.new.tmp retrying... > 2014-10-29 23:49:46,283 INFO > org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore: > Error updating info for attempt: appattempt_1409135750325_109118_01 > java.io.IOException: Unable to close file because the last block does not > have enough number of replicas. > 2014-10-29 23:49:46,284 ERROR > org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore: > Error storing/updating appAttempt: appattempt_1409135750325_109118_01 > 2014-10-29 23:49:46,916 FATAL > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: > Received a org.apache.hadoop.yarn.server.resourcemanager.RMFatalEvent of type > STATE_STORE_OP_FAILED. Cause: > java.io.IOException: Unable to close file because the last block does not > have enough number of replicas. > at > org.apache.hadoop.hdfs.DFSOutputStream.completeFile(DFSOutputStream.java:2132) > > at org.apache.hadoop.hdfs.DFSOutputStream.close(DFSOutputStream.java:2100) > at > org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:70) > > at org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:103) > at > org.apache.hadoop.yarn.server.resourcemanager
[jira] [Commented] (YARN-2820) Do retry in FileSystemRMStateStore for better error recovery when update/store failure due to IOException.
[ https://issues.apache.org/jira/browse/YARN-2820?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14338083#comment-14338083 ] zhihai xu commented on YARN-2820: - [~ozawa], Thanks for the review. Both are very good suggestions. I uploaded a new patch YARN-2820.005.patch, which addressed both comments. Please review it. thanks zhihai > Do retry in FileSystemRMStateStore for better error recovery when > update/store failure due to IOException. > -- > > Key: YARN-2820 > URL: https://issues.apache.org/jira/browse/YARN-2820 > Project: Hadoop YARN > Issue Type: Improvement > Components: resourcemanager >Affects Versions: 2.5.0, 2.6.0 >Reporter: zhihai xu >Assignee: zhihai xu > Attachments: YARN-2820.000.patch, YARN-2820.001.patch, > YARN-2820.002.patch, YARN-2820.003.patch, YARN-2820.004.patch, > YARN-2820.005.patch > > > Do retry in FileSystemRMStateStore for better error recovery when > update/store failure due to IOException. > When we use FileSystemRMStateStore as yarn.resourcemanager.store.class, We > saw the following IOexception cause the RM shutdown. > {code} > 2014-10-29 23:49:12,202 INFO > org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore: > Updating info for attempt: appattempt_1409135750325_109118_01 at: > /tmp/hadoop-yarn/yarn/system/rmstore/FSRMStateRoot/RMAppRoot/application_1409135750325_109118/ > appattempt_1409135750325_109118_01 > 2014-10-29 23:49:19,495 INFO org.apache.hadoop.hdfs.DFSClient: Could not > complete > /tmp/hadoop-yarn/yarn/system/rmstore/FSRMStateRoot/RMAppRoot/application_1409135750325_109118/ > appattempt_1409135750325_109118_01.new.tmp retrying... > 2014-10-29 23:49:23,757 INFO org.apache.hadoop.hdfs.DFSClient: Could not > complete > /tmp/hadoop-yarn/yarn/system/rmstore/FSRMStateRoot/RMAppRoot/application_1409135750325_109118/ > appattempt_1409135750325_109118_01.new.tmp retrying... > 2014-10-29 23:49:31,120 INFO org.apache.hadoop.hdfs.DFSClient: Could not > complete > /tmp/hadoop-yarn/yarn/system/rmstore/FSRMStateRoot/RMAppRoot/application_1409135750325_109118/ > appattempt_1409135750325_109118_01.new.tmp retrying... > 2014-10-29 23:49:46,283 INFO > org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore: > Error updating info for attempt: appattempt_1409135750325_109118_01 > java.io.IOException: Unable to close file because the last block does not > have enough number of replicas. > 2014-10-29 23:49:46,284 ERROR > org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore: > Error storing/updating appAttempt: appattempt_1409135750325_109118_01 > 2014-10-29 23:49:46,916 FATAL > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: > Received a org.apache.hadoop.yarn.server.resourcemanager.RMFatalEvent of type > STATE_STORE_OP_FAILED. Cause: > java.io.IOException: Unable to close file because the last block does not > have enough number of replicas. > at > org.apache.hadoop.hdfs.DFSOutputStream.completeFile(DFSOutputStream.java:2132) > > at org.apache.hadoop.hdfs.DFSOutputStream.close(DFSOutputStream.java:2100) > at > org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:70) > > at org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:103) > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore.writeFile(FileSystemRMStateStore.java:522) > > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore.updateFile(FileSystemRMStateStore.java:534) > > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore.updateApplicationAttemptStateInternal(FileSystemRMStateStore.java:389) > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore.handleStoreEvent(RMStateStore.java:675) > > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:766) > > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:761) > > at > org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:173) > > at > org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:106) > at java.lang.Thread.run(Thread.java:744) > {code} > As discussed at YARN-1778, TestFSRMStateStore failure is also due to > IOException in storeApplicationStateInternal. > Stack trace from TestFSRMStateStore failure: > {code} > 2015-02-03 00:09:19,092 INFO [Thread-110] recovery.TestFSRMStateStore > (TestFSRMStateStore.java:run(285)) - testFSRMStateStoreClientRetry: Exception > org.apache.hadoop.ipc.RemoteException(java.i
[jira] [Commented] (YARN-2820) Do retry in FileSystemRMStateStore for better error recovery when update/store failure due to IOException.
[ https://issues.apache.org/jira/browse/YARN-2820?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14335005#comment-14335005 ] Tsuyoshi OZAWA commented on YARN-2820: -- [~zxu] Great job! We are almost there. To avoid repeating code for retry, I think it's better to have FSAction like ZKAction in ZKRMStateStore. What do you think? Minor nits: I prefer to have a line break after "=" for readability. {code} + public static final String FS_RM_STATE_STORE_NUM_RETRIES = RM_PREFIX + + "fs.state-store.num-retries"; + public static final String FS_RM_STATE_STORE_RETRY_INTERVAL_MS = RM_PREFIX + + "fs.state-store.retry-interval-ms"; {code} {code} public static final String FS_RM_STATE_STORE_NUM_RETRIES = RM_PREFIX + "fs.state-store.num-retries"; public static final String FS_RM_STATE_STORE_RETRY_INTERVAL_MS = RM_PREFIX + "fs.state-store.retry-interval-ms"; {code} > Do retry in FileSystemRMStateStore for better error recovery when > update/store failure due to IOException. > -- > > Key: YARN-2820 > URL: https://issues.apache.org/jira/browse/YARN-2820 > Project: Hadoop YARN > Issue Type: Improvement > Components: resourcemanager >Affects Versions: 2.5.0, 2.6.0 >Reporter: zhihai xu >Assignee: zhihai xu > Attachments: YARN-2820.000.patch, YARN-2820.001.patch, > YARN-2820.002.patch, YARN-2820.003.patch, YARN-2820.004.patch > > > Do retry in FileSystemRMStateStore for better error recovery when > update/store failure due to IOException. > When we use FileSystemRMStateStore as yarn.resourcemanager.store.class, We > saw the following IOexception cause the RM shutdown. > {code} > 2014-10-29 23:49:12,202 INFO > org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore: > Updating info for attempt: appattempt_1409135750325_109118_01 at: > /tmp/hadoop-yarn/yarn/system/rmstore/FSRMStateRoot/RMAppRoot/application_1409135750325_109118/ > appattempt_1409135750325_109118_01 > 2014-10-29 23:49:19,495 INFO org.apache.hadoop.hdfs.DFSClient: Could not > complete > /tmp/hadoop-yarn/yarn/system/rmstore/FSRMStateRoot/RMAppRoot/application_1409135750325_109118/ > appattempt_1409135750325_109118_01.new.tmp retrying... > 2014-10-29 23:49:23,757 INFO org.apache.hadoop.hdfs.DFSClient: Could not > complete > /tmp/hadoop-yarn/yarn/system/rmstore/FSRMStateRoot/RMAppRoot/application_1409135750325_109118/ > appattempt_1409135750325_109118_01.new.tmp retrying... > 2014-10-29 23:49:31,120 INFO org.apache.hadoop.hdfs.DFSClient: Could not > complete > /tmp/hadoop-yarn/yarn/system/rmstore/FSRMStateRoot/RMAppRoot/application_1409135750325_109118/ > appattempt_1409135750325_109118_01.new.tmp retrying... > 2014-10-29 23:49:46,283 INFO > org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore: > Error updating info for attempt: appattempt_1409135750325_109118_01 > java.io.IOException: Unable to close file because the last block does not > have enough number of replicas. > 2014-10-29 23:49:46,284 ERROR > org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore: > Error storing/updating appAttempt: appattempt_1409135750325_109118_01 > 2014-10-29 23:49:46,916 FATAL > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: > Received a org.apache.hadoop.yarn.server.resourcemanager.RMFatalEvent of type > STATE_STORE_OP_FAILED. Cause: > java.io.IOException: Unable to close file because the last block does not > have enough number of replicas. > at > org.apache.hadoop.hdfs.DFSOutputStream.completeFile(DFSOutputStream.java:2132) > > at org.apache.hadoop.hdfs.DFSOutputStream.close(DFSOutputStream.java:2100) > at > org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:70) > > at org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:103) > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore.writeFile(FileSystemRMStateStore.java:522) > > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore.updateFile(FileSystemRMStateStore.java:534) > > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore.updateApplicationAttemptStateInternal(FileSystemRMStateStore.java:389) > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore.handleStoreEvent(RMStateStore.java:675) > > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:766) > > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:761) > > at > org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDi
[jira] [Commented] (YARN-2820) Do retry in FileSystemRMStateStore for better error recovery when update/store failure due to IOException.
[ https://issues.apache.org/jira/browse/YARN-2820?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14334748#comment-14334748 ] Hadoop QA commented on YARN-2820: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12700337/YARN-2820.004.patch against trunk revision b610c68. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:red}-1 findbugs{color}. The patch appears to introduce 5 new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/6709//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-YARN-Build/6709//artifact/patchprocess/newPatchFindbugsWarningshadoop-yarn-server-resourcemanager.html Console output: https://builds.apache.org/job/PreCommit-YARN-Build/6709//console This message is automatically generated. > Do retry in FileSystemRMStateStore for better error recovery when > update/store failure due to IOException. > -- > > Key: YARN-2820 > URL: https://issues.apache.org/jira/browse/YARN-2820 > Project: Hadoop YARN > Issue Type: Improvement > Components: resourcemanager >Affects Versions: 2.5.0, 2.6.0 >Reporter: zhihai xu >Assignee: zhihai xu > Attachments: YARN-2820.000.patch, YARN-2820.001.patch, > YARN-2820.002.patch, YARN-2820.003.patch, YARN-2820.004.patch > > > Do retry in FileSystemRMStateStore for better error recovery when > update/store failure due to IOException. > When we use FileSystemRMStateStore as yarn.resourcemanager.store.class, We > saw the following IOexception cause the RM shutdown. > {code} > 2014-10-29 23:49:12,202 INFO > org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore: > Updating info for attempt: appattempt_1409135750325_109118_01 at: > /tmp/hadoop-yarn/yarn/system/rmstore/FSRMStateRoot/RMAppRoot/application_1409135750325_109118/ > appattempt_1409135750325_109118_01 > 2014-10-29 23:49:19,495 INFO org.apache.hadoop.hdfs.DFSClient: Could not > complete > /tmp/hadoop-yarn/yarn/system/rmstore/FSRMStateRoot/RMAppRoot/application_1409135750325_109118/ > appattempt_1409135750325_109118_01.new.tmp retrying... > 2014-10-29 23:49:23,757 INFO org.apache.hadoop.hdfs.DFSClient: Could not > complete > /tmp/hadoop-yarn/yarn/system/rmstore/FSRMStateRoot/RMAppRoot/application_1409135750325_109118/ > appattempt_1409135750325_109118_01.new.tmp retrying... > 2014-10-29 23:49:31,120 INFO org.apache.hadoop.hdfs.DFSClient: Could not > complete > /tmp/hadoop-yarn/yarn/system/rmstore/FSRMStateRoot/RMAppRoot/application_1409135750325_109118/ > appattempt_1409135750325_109118_01.new.tmp retrying... > 2014-10-29 23:49:46,283 INFO > org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore: > Error updating info for attempt: appattempt_1409135750325_109118_01 > java.io.IOException: Unable to close file because the last block does not > have enough number of replicas. > 2014-10-29 23:49:46,284 ERROR > org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore: > Error storing/updating appAttempt: appattempt_1409135750325_109118_01 > 2014-10-29 23:49:46,916 FATAL > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: > Received a org.apache.hadoop.yarn.server.resourcemanager.RMFatalEvent of type > STATE_STORE_OP_FAILED. Cause: > java.io.IOException: Unable to close file because the last block does not > have enough number of replicas. > at > org.apache.hadoop.hdfs.DFSOutputStream.completeFile(DFSOutputStream.java:2132) > > at org.apache.hadoop.hdfs.DFSOutputStream.close(DFSOutputStream.java:2100) > at > org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:70) > > at org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:103) > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMSt
[jira] [Commented] (YARN-2820) Do retry in FileSystemRMStateStore for better error recovery when update/store failure due to IOException.
[ https://issues.apache.org/jira/browse/YARN-2820?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14334381#comment-14334381 ] zhihai xu commented on YARN-2820: - [~ozawa], Sorry for the delay to update the patch. Your review was really thorough. Thanks for that. I uploaded a new patch YARN-2820.004.patch which addressed all your comments. Please review it. thanks zhihai > Do retry in FileSystemRMStateStore for better error recovery when > update/store failure due to IOException. > -- > > Key: YARN-2820 > URL: https://issues.apache.org/jira/browse/YARN-2820 > Project: Hadoop YARN > Issue Type: Improvement > Components: resourcemanager >Affects Versions: 2.5.0, 2.6.0 >Reporter: zhihai xu >Assignee: zhihai xu > Attachments: YARN-2820.000.patch, YARN-2820.001.patch, > YARN-2820.002.patch, YARN-2820.003.patch, YARN-2820.004.patch > > > Do retry in FileSystemRMStateStore for better error recovery when > update/store failure due to IOException. > When we use FileSystemRMStateStore as yarn.resourcemanager.store.class, We > saw the following IOexception cause the RM shutdown. > {code} > 2014-10-29 23:49:12,202 INFO > org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore: > Updating info for attempt: appattempt_1409135750325_109118_01 at: > /tmp/hadoop-yarn/yarn/system/rmstore/FSRMStateRoot/RMAppRoot/application_1409135750325_109118/ > appattempt_1409135750325_109118_01 > 2014-10-29 23:49:19,495 INFO org.apache.hadoop.hdfs.DFSClient: Could not > complete > /tmp/hadoop-yarn/yarn/system/rmstore/FSRMStateRoot/RMAppRoot/application_1409135750325_109118/ > appattempt_1409135750325_109118_01.new.tmp retrying... > 2014-10-29 23:49:23,757 INFO org.apache.hadoop.hdfs.DFSClient: Could not > complete > /tmp/hadoop-yarn/yarn/system/rmstore/FSRMStateRoot/RMAppRoot/application_1409135750325_109118/ > appattempt_1409135750325_109118_01.new.tmp retrying... > 2014-10-29 23:49:31,120 INFO org.apache.hadoop.hdfs.DFSClient: Could not > complete > /tmp/hadoop-yarn/yarn/system/rmstore/FSRMStateRoot/RMAppRoot/application_1409135750325_109118/ > appattempt_1409135750325_109118_01.new.tmp retrying... > 2014-10-29 23:49:46,283 INFO > org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore: > Error updating info for attempt: appattempt_1409135750325_109118_01 > java.io.IOException: Unable to close file because the last block does not > have enough number of replicas. > 2014-10-29 23:49:46,284 ERROR > org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore: > Error storing/updating appAttempt: appattempt_1409135750325_109118_01 > 2014-10-29 23:49:46,916 FATAL > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: > Received a org.apache.hadoop.yarn.server.resourcemanager.RMFatalEvent of type > STATE_STORE_OP_FAILED. Cause: > java.io.IOException: Unable to close file because the last block does not > have enough number of replicas. > at > org.apache.hadoop.hdfs.DFSOutputStream.completeFile(DFSOutputStream.java:2132) > > at org.apache.hadoop.hdfs.DFSOutputStream.close(DFSOutputStream.java:2100) > at > org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:70) > > at org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:103) > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore.writeFile(FileSystemRMStateStore.java:522) > > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore.updateFile(FileSystemRMStateStore.java:534) > > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore.updateApplicationAttemptStateInternal(FileSystemRMStateStore.java:389) > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore.handleStoreEvent(RMStateStore.java:675) > > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:766) > > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:761) > > at > org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:173) > > at > org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:106) > at java.lang.Thread.run(Thread.java:744) > {code} > As discussed at YARN-1778, TestFSRMStateStore failure is also due to > IOException in storeApplicationStateInternal. > Stack trace from TestFSRMStateStore failure: > {code} > 2015-02-03 00:09:19,092 INFO [Thread-110] recovery.TestFSRMStateStore > (TestFSRMStateStore.java:run(285)) - testFSRMStateStoreClientRetry: Exception > org.apache.hadoop.ipc.RemoteEx
[jira] [Commented] (YARN-2820) Do retry in FileSystemRMStateStore for better error recovery when update/store failure due to IOException.
[ https://issues.apache.org/jira/browse/YARN-2820?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14328862#comment-14328862 ] Tsuyoshi OZAWA commented on YARN-2820: -- [~zxu] Thank you for updating a patch. 1. Should we create "*WithRetries" methods for deleteFile/renameFile/createFile/getFileStatus too? Note that we should update "replaceFile" to use renameFileWithRetires instead of calling fs.rename(srcPath, dstPath) directly: {code} protected void replaceFile(Path srcPath, Path dstPath) throws Exception { if (fs.exists(dstPath)) { deleteFile(dstPath); } else { LOG.info("File doesn't exist. Skip deleting the file " + dstPath); } fs.rename(srcPath, dstPath); } {code} 2. Should we create existsWithRetries and use it instead of fs.exists()? 2. Please move *WithRetries methods below the following comment: {code} // FileSystem related code {code} > Do retry in FileSystemRMStateStore for better error recovery when > update/store failure due to IOException. > -- > > Key: YARN-2820 > URL: https://issues.apache.org/jira/browse/YARN-2820 > Project: Hadoop YARN > Issue Type: Improvement > Components: resourcemanager >Affects Versions: 2.5.0 >Reporter: zhihai xu >Assignee: zhihai xu > Attachments: YARN-2820.000.patch, YARN-2820.001.patch, > YARN-2820.002.patch, YARN-2820.003.patch > > > Do retry in FileSystemRMStateStore for better error recovery when > update/store failure due to IOException. > When we use FileSystemRMStateStore as yarn.resourcemanager.store.class, We > saw the following IOexception cause the RM shutdown. > {code} > 2014-10-29 23:49:12,202 INFO > org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore: > Updating info for attempt: appattempt_1409135750325_109118_01 at: > /tmp/hadoop-yarn/yarn/system/rmstore/FSRMStateRoot/RMAppRoot/application_1409135750325_109118/ > appattempt_1409135750325_109118_01 > 2014-10-29 23:49:19,495 INFO org.apache.hadoop.hdfs.DFSClient: Could not > complete > /tmp/hadoop-yarn/yarn/system/rmstore/FSRMStateRoot/RMAppRoot/application_1409135750325_109118/ > appattempt_1409135750325_109118_01.new.tmp retrying... > 2014-10-29 23:49:23,757 INFO org.apache.hadoop.hdfs.DFSClient: Could not > complete > /tmp/hadoop-yarn/yarn/system/rmstore/FSRMStateRoot/RMAppRoot/application_1409135750325_109118/ > appattempt_1409135750325_109118_01.new.tmp retrying... > 2014-10-29 23:49:31,120 INFO org.apache.hadoop.hdfs.DFSClient: Could not > complete > /tmp/hadoop-yarn/yarn/system/rmstore/FSRMStateRoot/RMAppRoot/application_1409135750325_109118/ > appattempt_1409135750325_109118_01.new.tmp retrying... > 2014-10-29 23:49:46,283 INFO > org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore: > Error updating info for attempt: appattempt_1409135750325_109118_01 > java.io.IOException: Unable to close file because the last block does not > have enough number of replicas. > 2014-10-29 23:49:46,284 ERROR > org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore: > Error storing/updating appAttempt: appattempt_1409135750325_109118_01 > 2014-10-29 23:49:46,916 FATAL > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: > Received a org.apache.hadoop.yarn.server.resourcemanager.RMFatalEvent of type > STATE_STORE_OP_FAILED. Cause: > java.io.IOException: Unable to close file because the last block does not > have enough number of replicas. > at > org.apache.hadoop.hdfs.DFSOutputStream.completeFile(DFSOutputStream.java:2132) > > at org.apache.hadoop.hdfs.DFSOutputStream.close(DFSOutputStream.java:2100) > at > org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:70) > > at org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:103) > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore.writeFile(FileSystemRMStateStore.java:522) > > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore.updateFile(FileSystemRMStateStore.java:534) > > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore.updateApplicationAttemptStateInternal(FileSystemRMStateStore.java:389) > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore.handleStoreEvent(RMStateStore.java:675) > > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:766) > > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:761) > > at > org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:173) > > a
[jira] [Commented] (YARN-2820) Do retry in FileSystemRMStateStore for better error recovery when update/store failure due to IOException.
[ https://issues.apache.org/jira/browse/YARN-2820?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14328593#comment-14328593 ] Tsuyoshi OZAWA commented on YARN-2820: -- I'll take a look. > Do retry in FileSystemRMStateStore for better error recovery when > update/store failure due to IOException. > -- > > Key: YARN-2820 > URL: https://issues.apache.org/jira/browse/YARN-2820 > Project: Hadoop YARN > Issue Type: Improvement > Components: resourcemanager >Affects Versions: 2.5.0 >Reporter: zhihai xu >Assignee: zhihai xu > Attachments: YARN-2820.000.patch, YARN-2820.001.patch, > YARN-2820.002.patch, YARN-2820.003.patch > > > Do retry in FileSystemRMStateStore for better error recovery when > update/store failure due to IOException. > When we use FileSystemRMStateStore as yarn.resourcemanager.store.class, We > saw the following IOexception cause the RM shutdown. > {code} > 2014-10-29 23:49:12,202 INFO > org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore: > Updating info for attempt: appattempt_1409135750325_109118_01 at: > /tmp/hadoop-yarn/yarn/system/rmstore/FSRMStateRoot/RMAppRoot/application_1409135750325_109118/ > appattempt_1409135750325_109118_01 > 2014-10-29 23:49:19,495 INFO org.apache.hadoop.hdfs.DFSClient: Could not > complete > /tmp/hadoop-yarn/yarn/system/rmstore/FSRMStateRoot/RMAppRoot/application_1409135750325_109118/ > appattempt_1409135750325_109118_01.new.tmp retrying... > 2014-10-29 23:49:23,757 INFO org.apache.hadoop.hdfs.DFSClient: Could not > complete > /tmp/hadoop-yarn/yarn/system/rmstore/FSRMStateRoot/RMAppRoot/application_1409135750325_109118/ > appattempt_1409135750325_109118_01.new.tmp retrying... > 2014-10-29 23:49:31,120 INFO org.apache.hadoop.hdfs.DFSClient: Could not > complete > /tmp/hadoop-yarn/yarn/system/rmstore/FSRMStateRoot/RMAppRoot/application_1409135750325_109118/ > appattempt_1409135750325_109118_01.new.tmp retrying... > 2014-10-29 23:49:46,283 INFO > org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore: > Error updating info for attempt: appattempt_1409135750325_109118_01 > java.io.IOException: Unable to close file because the last block does not > have enough number of replicas. > 2014-10-29 23:49:46,284 ERROR > org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore: > Error storing/updating appAttempt: appattempt_1409135750325_109118_01 > 2014-10-29 23:49:46,916 FATAL > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: > Received a org.apache.hadoop.yarn.server.resourcemanager.RMFatalEvent of type > STATE_STORE_OP_FAILED. Cause: > java.io.IOException: Unable to close file because the last block does not > have enough number of replicas. > at > org.apache.hadoop.hdfs.DFSOutputStream.completeFile(DFSOutputStream.java:2132) > > at org.apache.hadoop.hdfs.DFSOutputStream.close(DFSOutputStream.java:2100) > at > org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:70) > > at org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:103) > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore.writeFile(FileSystemRMStateStore.java:522) > > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore.updateFile(FileSystemRMStateStore.java:534) > > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore.updateApplicationAttemptStateInternal(FileSystemRMStateStore.java:389) > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore.handleStoreEvent(RMStateStore.java:675) > > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:766) > > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:761) > > at > org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:173) > > at > org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:106) > at java.lang.Thread.run(Thread.java:744) > {code} > As discussed at YARN-1778, TestFSRMStateStore failure is also due to > IOException in storeApplicationStateInternal. > Stack trace from TestFSRMStateStore failure: > {code} > 2015-02-03 00:09:19,092 INFO [Thread-110] recovery.TestFSRMStateStore > (TestFSRMStateStore.java:run(285)) - testFSRMStateStoreClientRetry: Exception > org.apache.hadoop.ipc.RemoteException(java.io.IOException): NameNode still > not started >at > org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.checkNNStartup(NameNodeRpcServer.java:1876) >at > org.apache.hadoop.hdf
[jira] [Commented] (YARN-2820) Do retry in FileSystemRMStateStore for better error recovery when update/store failure due to IOException.
[ https://issues.apache.org/jira/browse/YARN-2820?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14328576#comment-14328576 ] zhihai xu commented on YARN-2820: - All these 5 findbugs are not related to my change. > Do retry in FileSystemRMStateStore for better error recovery when > update/store failure due to IOException. > -- > > Key: YARN-2820 > URL: https://issues.apache.org/jira/browse/YARN-2820 > Project: Hadoop YARN > Issue Type: Improvement > Components: resourcemanager >Affects Versions: 2.5.0 >Reporter: zhihai xu >Assignee: zhihai xu > Attachments: YARN-2820.000.patch, YARN-2820.001.patch, > YARN-2820.002.patch, YARN-2820.003.patch > > > Do retry in FileSystemRMStateStore for better error recovery when > update/store failure due to IOException. > When we use FileSystemRMStateStore as yarn.resourcemanager.store.class, We > saw the following IOexception cause the RM shutdown. > {code} > 2014-10-29 23:49:12,202 INFO > org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore: > Updating info for attempt: appattempt_1409135750325_109118_01 at: > /tmp/hadoop-yarn/yarn/system/rmstore/FSRMStateRoot/RMAppRoot/application_1409135750325_109118/ > appattempt_1409135750325_109118_01 > 2014-10-29 23:49:19,495 INFO org.apache.hadoop.hdfs.DFSClient: Could not > complete > /tmp/hadoop-yarn/yarn/system/rmstore/FSRMStateRoot/RMAppRoot/application_1409135750325_109118/ > appattempt_1409135750325_109118_01.new.tmp retrying... > 2014-10-29 23:49:23,757 INFO org.apache.hadoop.hdfs.DFSClient: Could not > complete > /tmp/hadoop-yarn/yarn/system/rmstore/FSRMStateRoot/RMAppRoot/application_1409135750325_109118/ > appattempt_1409135750325_109118_01.new.tmp retrying... > 2014-10-29 23:49:31,120 INFO org.apache.hadoop.hdfs.DFSClient: Could not > complete > /tmp/hadoop-yarn/yarn/system/rmstore/FSRMStateRoot/RMAppRoot/application_1409135750325_109118/ > appattempt_1409135750325_109118_01.new.tmp retrying... > 2014-10-29 23:49:46,283 INFO > org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore: > Error updating info for attempt: appattempt_1409135750325_109118_01 > java.io.IOException: Unable to close file because the last block does not > have enough number of replicas. > 2014-10-29 23:49:46,284 ERROR > org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore: > Error storing/updating appAttempt: appattempt_1409135750325_109118_01 > 2014-10-29 23:49:46,916 FATAL > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: > Received a org.apache.hadoop.yarn.server.resourcemanager.RMFatalEvent of type > STATE_STORE_OP_FAILED. Cause: > java.io.IOException: Unable to close file because the last block does not > have enough number of replicas. > at > org.apache.hadoop.hdfs.DFSOutputStream.completeFile(DFSOutputStream.java:2132) > > at org.apache.hadoop.hdfs.DFSOutputStream.close(DFSOutputStream.java:2100) > at > org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:70) > > at org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:103) > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore.writeFile(FileSystemRMStateStore.java:522) > > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore.updateFile(FileSystemRMStateStore.java:534) > > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore.updateApplicationAttemptStateInternal(FileSystemRMStateStore.java:389) > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore.handleStoreEvent(RMStateStore.java:675) > > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:766) > > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:761) > > at > org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:173) > > at > org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:106) > at java.lang.Thread.run(Thread.java:744) > {code} > As discussed at YARN-1778, TestFSRMStateStore failure is also due to > IOException in storeApplicationStateInternal. > Stack trace from TestFSRMStateStore failure: > {code} > 2015-02-03 00:09:19,092 INFO [Thread-110] recovery.TestFSRMStateStore > (TestFSRMStateStore.java:run(285)) - testFSRMStateStoreClientRetry: Exception > org.apache.hadoop.ipc.RemoteException(java.io.IOException): NameNode still > not started >at > org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.checkNNStartup(NameNodeRpcServer.java:1876) >at
[jira] [Commented] (YARN-2820) Do retry in FileSystemRMStateStore for better error recovery when update/store failure due to IOException.
[ https://issues.apache.org/jira/browse/YARN-2820?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14325676#comment-14325676 ] Hadoop QA commented on YARN-2820: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12699443/YARN-2820.003.patch against trunk revision b6fc1f3. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:red}-1 findbugs{color}. The patch appears to introduce 5 new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/6658//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-YARN-Build/6658//artifact/patchprocess/newPatchFindbugsWarningshadoop-yarn-server-resourcemanager.html Console output: https://builds.apache.org/job/PreCommit-YARN-Build/6658//console This message is automatically generated. > Do retry in FileSystemRMStateStore for better error recovery when > update/store failure due to IOException. > -- > > Key: YARN-2820 > URL: https://issues.apache.org/jira/browse/YARN-2820 > Project: Hadoop YARN > Issue Type: Improvement > Components: resourcemanager >Affects Versions: 2.5.0 >Reporter: zhihai xu >Assignee: zhihai xu > Attachments: YARN-2820.000.patch, YARN-2820.001.patch, > YARN-2820.002.patch, YARN-2820.003.patch > > > Do retry in FileSystemRMStateStore for better error recovery when > update/store failure due to IOException. > When we use FileSystemRMStateStore as yarn.resourcemanager.store.class, We > saw the following IOexception cause the RM shutdown. > {code} > 2014-10-29 23:49:12,202 INFO > org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore: > Updating info for attempt: appattempt_1409135750325_109118_01 at: > /tmp/hadoop-yarn/yarn/system/rmstore/FSRMStateRoot/RMAppRoot/application_1409135750325_109118/ > appattempt_1409135750325_109118_01 > 2014-10-29 23:49:19,495 INFO org.apache.hadoop.hdfs.DFSClient: Could not > complete > /tmp/hadoop-yarn/yarn/system/rmstore/FSRMStateRoot/RMAppRoot/application_1409135750325_109118/ > appattempt_1409135750325_109118_01.new.tmp retrying... > 2014-10-29 23:49:23,757 INFO org.apache.hadoop.hdfs.DFSClient: Could not > complete > /tmp/hadoop-yarn/yarn/system/rmstore/FSRMStateRoot/RMAppRoot/application_1409135750325_109118/ > appattempt_1409135750325_109118_01.new.tmp retrying... > 2014-10-29 23:49:31,120 INFO org.apache.hadoop.hdfs.DFSClient: Could not > complete > /tmp/hadoop-yarn/yarn/system/rmstore/FSRMStateRoot/RMAppRoot/application_1409135750325_109118/ > appattempt_1409135750325_109118_01.new.tmp retrying... > 2014-10-29 23:49:46,283 INFO > org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore: > Error updating info for attempt: appattempt_1409135750325_109118_01 > java.io.IOException: Unable to close file because the last block does not > have enough number of replicas. > 2014-10-29 23:49:46,284 ERROR > org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore: > Error storing/updating appAttempt: appattempt_1409135750325_109118_01 > 2014-10-29 23:49:46,916 FATAL > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: > Received a org.apache.hadoop.yarn.server.resourcemanager.RMFatalEvent of type > STATE_STORE_OP_FAILED. Cause: > java.io.IOException: Unable to close file because the last block does not > have enough number of replicas. > at > org.apache.hadoop.hdfs.DFSOutputStream.completeFile(DFSOutputStream.java:2132) > > at org.apache.hadoop.hdfs.DFSOutputStream.close(DFSOutputStream.java:2100) > at > org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:70) > > at org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:103) > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore.writeFile(FileSyste
[jira] [Commented] (YARN-2820) Do retry in FileSystemRMStateStore for better error recovery when update/store failure due to IOException.
[ https://issues.apache.org/jira/browse/YARN-2820?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14325621#comment-14325621 ] zhihai xu commented on YARN-2820: - I checked the warning message, all these 5 findbugs are not related to my change. > Do retry in FileSystemRMStateStore for better error recovery when > update/store failure due to IOException. > -- > > Key: YARN-2820 > URL: https://issues.apache.org/jira/browse/YARN-2820 > Project: Hadoop YARN > Issue Type: Improvement > Components: resourcemanager >Affects Versions: 2.5.0 >Reporter: zhihai xu >Assignee: zhihai xu > Attachments: YARN-2820.000.patch, YARN-2820.001.patch, > YARN-2820.002.patch, YARN-2820.003.patch > > > Do retry in FileSystemRMStateStore for better error recovery when > update/store failure due to IOException. > When we use FileSystemRMStateStore as yarn.resourcemanager.store.class, We > saw the following IOexception cause the RM shutdown. > {code} > 2014-10-29 23:49:12,202 INFO > org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore: > Updating info for attempt: appattempt_1409135750325_109118_01 at: > /tmp/hadoop-yarn/yarn/system/rmstore/FSRMStateRoot/RMAppRoot/application_1409135750325_109118/ > appattempt_1409135750325_109118_01 > 2014-10-29 23:49:19,495 INFO org.apache.hadoop.hdfs.DFSClient: Could not > complete > /tmp/hadoop-yarn/yarn/system/rmstore/FSRMStateRoot/RMAppRoot/application_1409135750325_109118/ > appattempt_1409135750325_109118_01.new.tmp retrying... > 2014-10-29 23:49:23,757 INFO org.apache.hadoop.hdfs.DFSClient: Could not > complete > /tmp/hadoop-yarn/yarn/system/rmstore/FSRMStateRoot/RMAppRoot/application_1409135750325_109118/ > appattempt_1409135750325_109118_01.new.tmp retrying... > 2014-10-29 23:49:31,120 INFO org.apache.hadoop.hdfs.DFSClient: Could not > complete > /tmp/hadoop-yarn/yarn/system/rmstore/FSRMStateRoot/RMAppRoot/application_1409135750325_109118/ > appattempt_1409135750325_109118_01.new.tmp retrying... > 2014-10-29 23:49:46,283 INFO > org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore: > Error updating info for attempt: appattempt_1409135750325_109118_01 > java.io.IOException: Unable to close file because the last block does not > have enough number of replicas. > 2014-10-29 23:49:46,284 ERROR > org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore: > Error storing/updating appAttempt: appattempt_1409135750325_109118_01 > 2014-10-29 23:49:46,916 FATAL > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: > Received a org.apache.hadoop.yarn.server.resourcemanager.RMFatalEvent of type > STATE_STORE_OP_FAILED. Cause: > java.io.IOException: Unable to close file because the last block does not > have enough number of replicas. > at > org.apache.hadoop.hdfs.DFSOutputStream.completeFile(DFSOutputStream.java:2132) > > at org.apache.hadoop.hdfs.DFSOutputStream.close(DFSOutputStream.java:2100) > at > org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:70) > > at org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:103) > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore.writeFile(FileSystemRMStateStore.java:522) > > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore.updateFile(FileSystemRMStateStore.java:534) > > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore.updateApplicationAttemptStateInternal(FileSystemRMStateStore.java:389) > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore.handleStoreEvent(RMStateStore.java:675) > > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:766) > > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:761) > > at > org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:173) > > at > org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:106) > at java.lang.Thread.run(Thread.java:744) > {code} > As discussed at YARN-1778, TestFSRMStateStore failure is also due to > IOException in storeApplicationStateInternal. > Stack trace from TestFSRMStateStore failure: > {code} > 2015-02-03 00:09:19,092 INFO [Thread-110] recovery.TestFSRMStateStore > (TestFSRMStateStore.java:run(285)) - testFSRMStateStoreClientRetry: Exception > org.apache.hadoop.ipc.RemoteException(java.io.IOException): NameNode still > not started >at > org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.checkNNStartup(NameNodeRp
[jira] [Commented] (YARN-2820) Do retry in FileSystemRMStateStore for better error recovery when update/store failure due to IOException.
[ https://issues.apache.org/jira/browse/YARN-2820?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14325616#comment-14325616 ] Hadoop QA commented on YARN-2820: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12699439/YARN-2820.002.patch against trunk revision b6fc1f3. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:red}-1 findbugs{color}. The patch appears to introduce 5 new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/6657//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-YARN-Build/6657//artifact/patchprocess/newPatchFindbugsWarningshadoop-yarn-server-resourcemanager.html Console output: https://builds.apache.org/job/PreCommit-YARN-Build/6657//console This message is automatically generated. > Do retry in FileSystemRMStateStore for better error recovery when > update/store failure due to IOException. > -- > > Key: YARN-2820 > URL: https://issues.apache.org/jira/browse/YARN-2820 > Project: Hadoop YARN > Issue Type: Improvement > Components: resourcemanager >Affects Versions: 2.5.0 >Reporter: zhihai xu >Assignee: zhihai xu > Attachments: YARN-2820.000.patch, YARN-2820.001.patch, > YARN-2820.002.patch, YARN-2820.003.patch > > > Do retry in FileSystemRMStateStore for better error recovery when > update/store failure due to IOException. > When we use FileSystemRMStateStore as yarn.resourcemanager.store.class, We > saw the following IOexception cause the RM shutdown. > {code} > 2014-10-29 23:49:12,202 INFO > org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore: > Updating info for attempt: appattempt_1409135750325_109118_01 at: > /tmp/hadoop-yarn/yarn/system/rmstore/FSRMStateRoot/RMAppRoot/application_1409135750325_109118/ > appattempt_1409135750325_109118_01 > 2014-10-29 23:49:19,495 INFO org.apache.hadoop.hdfs.DFSClient: Could not > complete > /tmp/hadoop-yarn/yarn/system/rmstore/FSRMStateRoot/RMAppRoot/application_1409135750325_109118/ > appattempt_1409135750325_109118_01.new.tmp retrying... > 2014-10-29 23:49:23,757 INFO org.apache.hadoop.hdfs.DFSClient: Could not > complete > /tmp/hadoop-yarn/yarn/system/rmstore/FSRMStateRoot/RMAppRoot/application_1409135750325_109118/ > appattempt_1409135750325_109118_01.new.tmp retrying... > 2014-10-29 23:49:31,120 INFO org.apache.hadoop.hdfs.DFSClient: Could not > complete > /tmp/hadoop-yarn/yarn/system/rmstore/FSRMStateRoot/RMAppRoot/application_1409135750325_109118/ > appattempt_1409135750325_109118_01.new.tmp retrying... > 2014-10-29 23:49:46,283 INFO > org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore: > Error updating info for attempt: appattempt_1409135750325_109118_01 > java.io.IOException: Unable to close file because the last block does not > have enough number of replicas. > 2014-10-29 23:49:46,284 ERROR > org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore: > Error storing/updating appAttempt: appattempt_1409135750325_109118_01 > 2014-10-29 23:49:46,916 FATAL > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: > Received a org.apache.hadoop.yarn.server.resourcemanager.RMFatalEvent of type > STATE_STORE_OP_FAILED. Cause: > java.io.IOException: Unable to close file because the last block does not > have enough number of replicas. > at > org.apache.hadoop.hdfs.DFSOutputStream.completeFile(DFSOutputStream.java:2132) > > at org.apache.hadoop.hdfs.DFSOutputStream.close(DFSOutputStream.java:2100) > at > org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:70) > > at org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:103) > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore.writeFile(FileSyste
[jira] [Commented] (YARN-2820) Do retry in FileSystemRMStateStore for better error recovery when update/store failure due to IOException.
[ https://issues.apache.org/jira/browse/YARN-2820?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14325589#comment-14325589 ] zhihai xu commented on YARN-2820: - [~ozawa], thanks for the review. Your suggestion is good. I uploaded a new patch YARN-2820.003.patch, which addressed your comment. please review it. thanks zhihai > Do retry in FileSystemRMStateStore for better error recovery when > update/store failure due to IOException. > -- > > Key: YARN-2820 > URL: https://issues.apache.org/jira/browse/YARN-2820 > Project: Hadoop YARN > Issue Type: Improvement > Components: resourcemanager >Affects Versions: 2.5.0 >Reporter: zhihai xu >Assignee: zhihai xu > Attachments: YARN-2820.000.patch, YARN-2820.001.patch, > YARN-2820.002.patch, YARN-2820.003.patch > > > Do retry in FileSystemRMStateStore for better error recovery when > update/store failure due to IOException. > When we use FileSystemRMStateStore as yarn.resourcemanager.store.class, We > saw the following IOexception cause the RM shutdown. > {code} > 2014-10-29 23:49:12,202 INFO > org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore: > Updating info for attempt: appattempt_1409135750325_109118_01 at: > /tmp/hadoop-yarn/yarn/system/rmstore/FSRMStateRoot/RMAppRoot/application_1409135750325_109118/ > appattempt_1409135750325_109118_01 > 2014-10-29 23:49:19,495 INFO org.apache.hadoop.hdfs.DFSClient: Could not > complete > /tmp/hadoop-yarn/yarn/system/rmstore/FSRMStateRoot/RMAppRoot/application_1409135750325_109118/ > appattempt_1409135750325_109118_01.new.tmp retrying... > 2014-10-29 23:49:23,757 INFO org.apache.hadoop.hdfs.DFSClient: Could not > complete > /tmp/hadoop-yarn/yarn/system/rmstore/FSRMStateRoot/RMAppRoot/application_1409135750325_109118/ > appattempt_1409135750325_109118_01.new.tmp retrying... > 2014-10-29 23:49:31,120 INFO org.apache.hadoop.hdfs.DFSClient: Could not > complete > /tmp/hadoop-yarn/yarn/system/rmstore/FSRMStateRoot/RMAppRoot/application_1409135750325_109118/ > appattempt_1409135750325_109118_01.new.tmp retrying... > 2014-10-29 23:49:46,283 INFO > org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore: > Error updating info for attempt: appattempt_1409135750325_109118_01 > java.io.IOException: Unable to close file because the last block does not > have enough number of replicas. > 2014-10-29 23:49:46,284 ERROR > org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore: > Error storing/updating appAttempt: appattempt_1409135750325_109118_01 > 2014-10-29 23:49:46,916 FATAL > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: > Received a org.apache.hadoop.yarn.server.resourcemanager.RMFatalEvent of type > STATE_STORE_OP_FAILED. Cause: > java.io.IOException: Unable to close file because the last block does not > have enough number of replicas. > at > org.apache.hadoop.hdfs.DFSOutputStream.completeFile(DFSOutputStream.java:2132) > > at org.apache.hadoop.hdfs.DFSOutputStream.close(DFSOutputStream.java:2100) > at > org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:70) > > at org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:103) > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore.writeFile(FileSystemRMStateStore.java:522) > > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore.updateFile(FileSystemRMStateStore.java:534) > > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore.updateApplicationAttemptStateInternal(FileSystemRMStateStore.java:389) > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore.handleStoreEvent(RMStateStore.java:675) > > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:766) > > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:761) > > at > org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:173) > > at > org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:106) > at java.lang.Thread.run(Thread.java:744) > {code} > As discussed at YARN-1778, TestFSRMStateStore failure is also due to > IOException in storeApplicationStateInternal. > Stack trace from TestFSRMStateStore failure: > {code} > 2015-02-03 00:09:19,092 INFO [Thread-110] recovery.TestFSRMStateStore > (TestFSRMStateStore.java:run(285)) - testFSRMStateStoreClientRetry: Exception > org.apache.hadoop.ipc.RemoteException(java.io.IOException): NameNode still > not started >at > o
[jira] [Commented] (YARN-2820) Do retry in FileSystemRMStateStore for better error recovery when update/store failure due to IOException.
[ https://issues.apache.org/jira/browse/YARN-2820?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14325561#comment-14325561 ] Tsuyoshi OZAWA commented on YARN-2820: -- [~zxu], thank you for the update! The patch looks good to me overall. One minor nits: to fix the test failure we faced on YARN-1778, how about making the value of YarnConfiguration.FS_RM_STATE_STORE_NUM_RETRIES larger in TestFSRMStateStore? > Do retry in FileSystemRMStateStore for better error recovery when > update/store failure due to IOException. > -- > > Key: YARN-2820 > URL: https://issues.apache.org/jira/browse/YARN-2820 > Project: Hadoop YARN > Issue Type: Improvement > Components: resourcemanager >Affects Versions: 2.5.0 >Reporter: zhihai xu >Assignee: zhihai xu > Attachments: YARN-2820.000.patch, YARN-2820.001.patch, > YARN-2820.002.patch > > > Do retry in FileSystemRMStateStore for better error recovery when > update/store failure due to IOException. > When we use FileSystemRMStateStore as yarn.resourcemanager.store.class, We > saw the following IOexception cause the RM shutdown. > {code} > 2014-10-29 23:49:12,202 INFO > org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore: > Updating info for attempt: appattempt_1409135750325_109118_01 at: > /tmp/hadoop-yarn/yarn/system/rmstore/FSRMStateRoot/RMAppRoot/application_1409135750325_109118/ > appattempt_1409135750325_109118_01 > 2014-10-29 23:49:19,495 INFO org.apache.hadoop.hdfs.DFSClient: Could not > complete > /tmp/hadoop-yarn/yarn/system/rmstore/FSRMStateRoot/RMAppRoot/application_1409135750325_109118/ > appattempt_1409135750325_109118_01.new.tmp retrying... > 2014-10-29 23:49:23,757 INFO org.apache.hadoop.hdfs.DFSClient: Could not > complete > /tmp/hadoop-yarn/yarn/system/rmstore/FSRMStateRoot/RMAppRoot/application_1409135750325_109118/ > appattempt_1409135750325_109118_01.new.tmp retrying... > 2014-10-29 23:49:31,120 INFO org.apache.hadoop.hdfs.DFSClient: Could not > complete > /tmp/hadoop-yarn/yarn/system/rmstore/FSRMStateRoot/RMAppRoot/application_1409135750325_109118/ > appattempt_1409135750325_109118_01.new.tmp retrying... > 2014-10-29 23:49:46,283 INFO > org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore: > Error updating info for attempt: appattempt_1409135750325_109118_01 > java.io.IOException: Unable to close file because the last block does not > have enough number of replicas. > 2014-10-29 23:49:46,284 ERROR > org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore: > Error storing/updating appAttempt: appattempt_1409135750325_109118_01 > 2014-10-29 23:49:46,916 FATAL > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: > Received a org.apache.hadoop.yarn.server.resourcemanager.RMFatalEvent of type > STATE_STORE_OP_FAILED. Cause: > java.io.IOException: Unable to close file because the last block does not > have enough number of replicas. > at > org.apache.hadoop.hdfs.DFSOutputStream.completeFile(DFSOutputStream.java:2132) > > at org.apache.hadoop.hdfs.DFSOutputStream.close(DFSOutputStream.java:2100) > at > org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:70) > > at org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:103) > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore.writeFile(FileSystemRMStateStore.java:522) > > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore.updateFile(FileSystemRMStateStore.java:534) > > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore.updateApplicationAttemptStateInternal(FileSystemRMStateStore.java:389) > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore.handleStoreEvent(RMStateStore.java:675) > > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:766) > > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:761) > > at > org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:173) > > at > org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:106) > at java.lang.Thread.run(Thread.java:744) > {code} > As discussed at YARN-1778, TestFSRMStateStore failure is also due to > IOException in storeApplicationStateInternal. > Stack trace from TestFSRMStateStore failure: > {code} > 2015-02-03 00:09:19,092 INFO [Thread-110] recovery.TestFSRMStateStore > (TestFSRMStateStore.java:run(285)) - testFSRMStateStoreClientRetry: Exception > org.apache.hadoop.ipc.RemoteExceptio