zhihai xu created YARN-2820:
-------------------------------

             Summary: Improve FileSystemRMStateStore update failure exception 
handling to not  shutdown RM.
                 Key: YARN-2820
                 URL: https://issues.apache.org/jira/browse/YARN-2820
             Project: Hadoop YARN
          Issue Type: Improvement
          Components: resourcemanager
    Affects Versions: 2.5.0
            Reporter: zhihai xu
            Assignee: zhihai xu


When we use FileSystemRMStateStore as yarn.resourcemanager.store.class, We saw 
the following IOexception cause the RM shutdown.

{code}
FATAL
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Received a 
org.apache.hadoop.yarn.server.resourcemanager.RMFatalEvent of type 
STATE_STORE_OP_FAILED. Cause: 
java.io.IOException: Unable to close file because the last block does not have 
enough number of replicas. 
at 
org.apache.hadoop.hdfs.DFSOutputStream.completeFile(DFSOutputStream.java:2132) 
at org.apache.hadoop.hdfs.DFSOutputStream.close(DFSOutputStream.java:2100) 
at 
org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:70)
 
at org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:103) 
at 
org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore.writeFile(FileSystemRMStateStore.java:522)
 
at 
org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore.updateFile(FileSystemRMStateStore.java:534)
 
at 
org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore.updateApplicationAttemptStateInternal(FileSystemRMStateStore.java:389)
at 
org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore.handleStoreEvent(RMStateStore.java:675)
 
at 
org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:766)
 
at 
org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:761)
 
at 
org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:173) 
at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:106) 
at java.lang.Thread.run(Thread.java:744) 
{code}

It will be better to  Improve FileSystemRMStateStore update failure exception 
handling to not  shutdown RM. So that a single state write out failure can't 
stop all jobs .



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to