[ https://issues.apache.org/jira/browse/YARN-3491?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14501002#comment-14501002 ]
zhihai xu commented on YARN-3491: --------------------------------- I did more profiling in checkLocalDir. It really surprised me. The most time-consuming code is status.getPermission() not lfs.getFileStatus. status.getPermission() will take 4 or 5 ms. checkLocalDir will call status.getPermission() three times. That is why checkLocalDir take 10+ms. {code} private boolean checkLocalDir(String localDir) { Map<Path, FsPermission> pathPermissionMap = getLocalDirsPathPermissionsMap(localDir); for (Map.Entry<Path, FsPermission> entry : pathPermissionMap.entrySet()) { FileStatus status; try { status = lfs.getFileStatus(entry.getKey()); } catch (Exception e) { String msg = "Could not carry out resource dir checks for " + localDir + ", which was marked as good"; LOG.warn(msg, e); throw new YarnRuntimeException(msg, e); } if (!status.getPermission().equals(entry.getValue())) { String msg = "Permissions incorrectly set for dir " + entry.getKey() + ", should be " + entry.getValue() + ", actual value = " + status.getPermission(); LOG.warn(msg); throw new YarnRuntimeException(msg); } } return true; } {code} Then I go deeper into the source code I find out why status.getPermission take the most of time: lfs.getFileStatus will return RawLocalFileSystem#DeprecatedRawLocalFileStatus, {code} public FsPermission getPermission() { if (!isPermissionLoaded()) { loadPermissionInfo(); } return super.getPermission(); } {code} So status.getPermission will call loadPermissionInfo, Based on the following code, loadPermissionInfo is bottle neck, it will call run "ls -ld" to get the permission, which is really slow. {code} /// loads permissions, owner, and group from `ls -ld` private void loadPermissionInfo() { IOException e = null; try { String output = FileUtil.execCommand(new File(getPath().toUri()), Shell.getGetPermissionCommand()); StringTokenizer t = new StringTokenizer(output, Shell.TOKEN_SEPARATOR_REGEX); //expected format //-rw------- 1 username groupname ... String permission = t.nextToken(); if (permission.length() > FsPermission.MAX_PERMISSION_LENGTH) { //files with ACLs might have a '+' permission = permission.substring(0, FsPermission.MAX_PERMISSION_LENGTH); } setPermission(FsPermission.valueOf(permission)); t.nextToken(); String owner = t.nextToken(); // If on windows domain, token format is DOMAIN\\user and we want to // extract only the user name if (Shell.WINDOWS) { int i = owner.indexOf('\\'); if (i != -1) owner = owner.substring(i + 1); } setOwner(owner); setGroup(t.nextToken()); } catch (Shell.ExitCodeException ioe) { if (ioe.getExitCode() != 1) { e = ioe; } else { setPermission(null); setOwner(null); setGroup(null); } } catch (IOException ioe) { e = ioe; } finally { if (e != null) { throw new RuntimeException("Error while running command to get " + "file permissions : " + StringUtils.stringifyException(e)); } } } {code} We should call getPermission as least as possible in the future :) > PublicLocalizer#addResource is too slow. > ---------------------------------------- > > Key: YARN-3491 > URL: https://issues.apache.org/jira/browse/YARN-3491 > Project: Hadoop YARN > Issue Type: Improvement > Components: nodemanager > Affects Versions: 2.7.0 > Reporter: zhihai xu > Assignee: zhihai xu > Priority: Critical > Attachments: YARN-3491.000.patch, YARN-3491.001.patch > > > Based on the profiling, The bottleneck in PublicLocalizer#addResource is > getInitializedLocalDirs. getInitializedLocalDirs call checkLocalDir. > checkLocalDir is very slow which takes about 10+ ms. > The total delay will be approximately number of local dirs * 10+ ms. > This delay will be added for each public resource localization. > Because PublicLocalizer#addResource is slow, the thread pool can't be fully > utilized. Instead of doing public resource localization in > parallel(multithreading), public resource localization is serialized most of > the time. > And also PublicLocalizer#addResource is running in Dispatcher thread, > So the Dispatcher thread will be blocked by PublicLocalizer#addResource for > long time. -- This message was sent by Atlassian JIRA (v6.3.4#6332)