[ 
https://issues.apache.org/jira/browse/YARN-3491?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14501002#comment-14501002
 ] 

zhihai xu commented on YARN-3491:
---------------------------------

I did more profiling in checkLocalDir. It really surprised me.
The most time-consuming code is status.getPermission() not lfs.getFileStatus.
status.getPermission() will take 4 or 5 ms. checkLocalDir will call 
status.getPermission() three times.
That is why checkLocalDir take 10+ms.
{code}
  private boolean checkLocalDir(String localDir) {

    Map<Path, FsPermission> pathPermissionMap = 
getLocalDirsPathPermissionsMap(localDir);

    for (Map.Entry<Path, FsPermission> entry : pathPermissionMap.entrySet()) {
      FileStatus status;
      try {
        status = lfs.getFileStatus(entry.getKey());
      } catch (Exception e) {
        String msg =
            "Could not carry out resource dir checks for " + localDir
                + ", which was marked as good";
        LOG.warn(msg, e);
        throw new YarnRuntimeException(msg, e);
      }

      if (!status.getPermission().equals(entry.getValue())) {
        String msg =
            "Permissions incorrectly set for dir " + entry.getKey()
                + ", should be " + entry.getValue() + ", actual value = "
                + status.getPermission();
        LOG.warn(msg);
        throw new YarnRuntimeException(msg);
      }
    }
    return true;
  }
{code}

Then I go deeper into the source code I find out why status.getPermission take 
the most of time:
lfs.getFileStatus will return RawLocalFileSystem#DeprecatedRawLocalFileStatus,
{code}
    public FsPermission getPermission() {
      if (!isPermissionLoaded()) {
        loadPermissionInfo();
      }
      return super.getPermission();
    }
{code}

So status.getPermission will call loadPermissionInfo,
Based on the following code, loadPermissionInfo is bottle neck, it will call 
run "ls -ld" to get the permission, which is really slow.
{code}
    /// loads permissions, owner, and group from `ls -ld`
    private void loadPermissionInfo() {
      IOException e = null;
      try {
        String output = FileUtil.execCommand(new File(getPath().toUri()), 
            Shell.getGetPermissionCommand());
        StringTokenizer t =
            new StringTokenizer(output, Shell.TOKEN_SEPARATOR_REGEX);
        //expected format
        //-rw-------    1 username groupname ...
        String permission = t.nextToken();
        if (permission.length() > FsPermission.MAX_PERMISSION_LENGTH) {
          //files with ACLs might have a '+'
          permission = permission.substring(0,
            FsPermission.MAX_PERMISSION_LENGTH);
        }
        setPermission(FsPermission.valueOf(permission));
        t.nextToken();

        String owner = t.nextToken();
        // If on windows domain, token format is DOMAIN\\user and we want to
        // extract only the user name
        if (Shell.WINDOWS) {
          int i = owner.indexOf('\\');
          if (i != -1)
            owner = owner.substring(i + 1);
        }
        setOwner(owner);

        setGroup(t.nextToken());
      } catch (Shell.ExitCodeException ioe) {
        if (ioe.getExitCode() != 1) {
          e = ioe;
        } else {
          setPermission(null);
          setOwner(null);
          setGroup(null);
        }
      } catch (IOException ioe) {
        e = ioe;
      } finally {
        if (e != null) {
          throw new RuntimeException("Error while running command to get " +
                                     "file permissions : " + 
                                     StringUtils.stringifyException(e));
        }
      }
    }
{code}

We should call getPermission as least as possible in the future :)

> PublicLocalizer#addResource is too slow.
> ----------------------------------------
>
>                 Key: YARN-3491
>                 URL: https://issues.apache.org/jira/browse/YARN-3491
>             Project: Hadoop YARN
>          Issue Type: Improvement
>          Components: nodemanager
>    Affects Versions: 2.7.0
>            Reporter: zhihai xu
>            Assignee: zhihai xu
>            Priority: Critical
>         Attachments: YARN-3491.000.patch, YARN-3491.001.patch
>
>
> Based on the profiling, The bottleneck in PublicLocalizer#addResource is 
> getInitializedLocalDirs. getInitializedLocalDirs call checkLocalDir.
> checkLocalDir is very slow which takes about 10+ ms.
> The total delay will be approximately number of local dirs * 10+ ms.
> This delay will be added for each public resource localization.
> Because PublicLocalizer#addResource is slow, the thread pool can't be fully 
> utilized. Instead of doing public resource localization in 
> parallel(multithreading), public resource localization is serialized most of 
> the time.
> And also PublicLocalizer#addResource is running in Dispatcher thread, 
> So the Dispatcher thread will be blocked by PublicLocalizer#addResource for 
> long time.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to