[ https://issues.apache.org/jira/browse/YARN-9521?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16876006#comment-16876006 ]
kyungwan nam commented on YARN-9521: ------------------------------------ after some further digging, I think that I figure out the cause of this issue more correctly. normally, when yarn-service API is requested, a new ugi is created and it is performed inside of the ugi.doAs() when calling FileSystem.get() inside of the ugi.doAs(), it always create a new FileSystem. because the ugi is used for the key of the FileSystem.CACHE. (YARN-3336 would be helpful to understand this) so in this case, does not close a FileSystem from the FileSystem.CACHE {code} UserGroupInformation ugi = getProxyUser(request); LOG.info("POST: createService = {} user = {}", service, ugi); if(service.getState()==ServiceState.STOPPED) { ugi.doAs(new PrivilegedExceptionAction<Void>() { @Override public Void run() throws YarnException, IOException { ServiceClient sc = getServiceClient(); try { sc.init(YARN_CONFIG); sc.start(); sc.actionBuild(service); } finally { sc.close(); } return null; } }); {code} on the other hand, ApiServiceClient.actionCleanUp which is called at RMAppImpl.appAdminClientCleanUp is performed as the RM loginUser instead of doAs() in this case, FileSystem.get() can return cached one which SystemServiceManagerImpl, FileSystemNodeLabelsStore refer {code} @Override public int actionCleanUp(String appName, String userName) throws IOException, YarnException { ServiceClient sc = new ServiceClient(); sc.init(getConfig()); sc.start(); int result = sc.actionCleanUp(appName, userName); sc.close(); return result; } {code} > RM failed to start due to system services > ----------------------------------------- > > Key: YARN-9521 > URL: https://issues.apache.org/jira/browse/YARN-9521 > Project: Hadoop YARN > Issue Type: Bug > Affects Versions: 3.1.2 > Reporter: kyungwan nam > Priority: Major > Attachments: YARN-9521.001.patch, YARN-9521.002.patch > > > when starting RM, listing system services directory has failed as follows. > {code} > 2019-04-30 17:18:25,441 INFO client.SystemServiceManagerImpl > (SystemServiceManagerImpl.java:serviceInit(114)) - System Service Directory > is configured to /services > 2019-04-30 17:18:25,467 INFO client.SystemServiceManagerImpl > (SystemServiceManagerImpl.java:serviceInit(120)) - UserGroupInformation > initialized to yarn (auth:SIMPLE) > 2019-04-30 17:18:25,467 INFO service.AbstractService > (AbstractService.java:noteFailure(267)) - Service ResourceManager failed in > state STARTED > org.apache.hadoop.service.ServiceStateException: java.io.IOException: > Filesystem closed > at > org.apache.hadoop.service.ServiceStateException.convert(ServiceStateException.java:105) > at > org.apache.hadoop.service.AbstractService.start(AbstractService.java:203) > at > org.apache.hadoop.service.CompositeService.serviceStart(CompositeService.java:121) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceStart(ResourceManager.java:869) > at > org.apache.hadoop.service.AbstractService.start(AbstractService.java:194) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.startActiveServices(ResourceManager.java:1228) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$1.run(ResourceManager.java:1269) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$1.run(ResourceManager.java:1265) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1729) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.transitionToActive(ResourceManager.java:1265) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.serviceStart(ResourceManager.java:1316) > at > org.apache.hadoop.service.AbstractService.start(AbstractService.java:194) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.main(ResourceManager.java:1501) > Caused by: java.io.IOException: Filesystem closed > at org.apache.hadoop.hdfs.DFSClient.checkOpen(DFSClient.java:473) > at org.apache.hadoop.hdfs.DFSClient.listPaths(DFSClient.java:1639) > at > org.apache.hadoop.hdfs.DistributedFileSystem$DirListingIterator.<init>(DistributedFileSystem.java:1217) > at > org.apache.hadoop.hdfs.DistributedFileSystem$DirListingIterator.<init>(DistributedFileSystem.java:1233) > at > org.apache.hadoop.hdfs.DistributedFileSystem$DirListingIterator.<init>(DistributedFileSystem.java:1200) > at > org.apache.hadoop.hdfs.DistributedFileSystem$26.doCall(DistributedFileSystem.java:1179) > at > org.apache.hadoop.hdfs.DistributedFileSystem$26.doCall(DistributedFileSystem.java:1175) > at > org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81) > at > org.apache.hadoop.hdfs.DistributedFileSystem.listStatusIterator(DistributedFileSystem.java:1187) > at > org.apache.hadoop.yarn.service.client.SystemServiceManagerImpl.list(SystemServiceManagerImpl.java:375) > at > org.apache.hadoop.yarn.service.client.SystemServiceManagerImpl.scanForUserServices(SystemServiceManagerImpl.java:282) > at > org.apache.hadoop.yarn.service.client.SystemServiceManagerImpl.serviceStart(SystemServiceManagerImpl.java:126) > at > org.apache.hadoop.service.AbstractService.start(AbstractService.java:194) > ... 13 more > {code} > it looks like due to the usage of filesystem cache. > this issue does not happen, when I add "fs.hdfs.impl.disable.cache=true" to > yarn-site -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org