Junping Du created YARN-3641: -------------------------------- Summary: stopRecoveryStore() shouldn't be skipped when exceptions happen in stopping NM's sub-services. Key: YARN-3641 URL: https://issues.apache.org/jira/browse/YARN-3641 Project: Hadoop YARN Issue Type: Bug Reporter: Junping Du Assignee: Junping Du Priority: Critical
If NM' services not get stopped properly, we cannot start NM with enabling NM restart with work preserving. The exception is as following: {noformat} org.apache.hadoop.service.ServiceStateException: org.fusesource.leveldbjni.internal.NativeDB$DBException: IO error: lock /var/log/hadoop-yarn/nodemanager/recovery-state/yarn-nm-state/LOCK: Resource temporarily unavailable at org.apache.hadoop.service.ServiceStateException.convert(ServiceStateException.java:59) at org.apache.hadoop.service.AbstractService.init(AbstractService.java:172) at org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartRecoveryStore(NodeManager.java:175) at org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceInit(NodeManager.java:217) at org.apache.hadoop.service.AbstractService.init(AbstractService.java:163) at org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartNodeManager(NodeManager.java:507) at org.apache.hadoop.yarn.server.nodemanager.NodeManager.main(NodeManager.java:555) Caused by: org.fusesource.leveldbjni.internal.NativeDB$DBException: IO error: lock /var/log/hadoop-yarn/nodemanager/recovery-state/yarn-nm-state/LOCK: Resource temporarily unavailable at org.fusesource.leveldbjni.internal.NativeDB.checkStatus(NativeDB.java:200) at org.fusesource.leveldbjni.internal.NativeDB.open(NativeDB.java:218) at org.fusesource.leveldbjni.JniDBFactory.open(JniDBFactory.java:168) at org.apache.hadoop.yarn.server.nodemanager.recovery.NMLeveldbStateStoreService.initStorage(NMLeveldbStateStoreService.java:930) at org.apache.hadoop.yarn.server.nodemanager.recovery.NMStateStoreService.serviceInit(NMStateStoreService.java:204) at org.apache.hadoop.service.AbstractService.init(AbstractService.java:163) ... 5 more 2015-05-12 00:34:45,262 INFO nodemanager.NodeManager (LogAdapter.java:info(45)) - SHUTDOWN_MSG: /************************************************************ SHUTDOWN_MSG: Shutting down NodeManager at c6403.ambari.apache.org/192.168.64.103 ************************************************************/ {noformat} The related code is as below in NodeManager.java: {code} @Override protected void serviceStop() throws Exception { if (isStopping.getAndSet(true)) { return; } super.serviceStop(); stopRecoveryStore(); DefaultMetricsSystem.shutdown(); } {code} We can see we stop all NM registered services (NodeStatusUpdater, LogAggregationService, ResourceLocalizationService, etc.) first. Any of services get stopped with exception could cause stopRecoveryStore() get skipped which means levelDB store is not get closed. So next time NM start, it will get failed with exception above. We should put stopRecoveryStore(); in a final block. -- This message was sent by Atlassian JIRA (v6.3.4#6332)