[ https://issues.apache.org/jira/browse/YARN-3491?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14504124#comment-14504124 ]
zhihai xu commented on YARN-3491: --------------------------------- Hi [~wilfreds], thanks for the review. A directory goes from bad to good can happen at any time, which is asynchronous to both public and private resource localization. Even without my change, it can still happen right after initialize local and log Dirs in current code. Also private resource localization initializes local and log Dirs per container not per resource. Our purpose is to make the failure chance less. bq. Looking over the code there is also a lot of unneeded object creation which could be stripped out speeding things up and lowering memory usage. I did the profiling for PublicLocalizer#addResource, all other code didn't take much time except checkLocalDir which calls getPermission three times. getPermission runs command "ls -ld" to get the permission, which is very slow. But your comment gives me some good idea to find a better solution which can save more time: We can call LocalDirsHandlerService#getLastDisksCheckTime to get the timestamp of previous disk-check. Using this information we only need initializes local and log Dirs when the timestamp is changed. The timestamp will only be changed every two minutes. It means we won't initialize local and log Dirs more than once in two minutes. {code} diskHealthCheckInterval = conf.getLong( YarnConfiguration.NM_DISK_HEALTH_CHECK_INTERVAL_MS, YarnConfiguration.DEFAULT_NM_DISK_HEALTH_CHECK_INTERVAL_MS); public static final long DEFAULT_NM_DISK_HEALTH_CHECK_INTERVAL_MS = 120000L; {code} Hi [~jlowe], Do you think my new idea is reasonable? I would greatly appreciate it if you kindly give me some feedbacks on my new idea. > PublicLocalizer#addResource is too slow. > ---------------------------------------- > > Key: YARN-3491 > URL: https://issues.apache.org/jira/browse/YARN-3491 > Project: Hadoop YARN > Issue Type: Improvement > Components: nodemanager > Affects Versions: 2.7.0 > Reporter: zhihai xu > Assignee: zhihai xu > Priority: Critical > Attachments: YARN-3491.000.patch, YARN-3491.001.patch > > > Based on the profiling, The bottleneck in PublicLocalizer#addResource is > getInitializedLocalDirs. getInitializedLocalDirs call checkLocalDir. > checkLocalDir is very slow which takes about 10+ ms. > The total delay will be approximately number of local dirs * 10+ ms. > This delay will be added for each public resource localization. > Because PublicLocalizer#addResource is slow, the thread pool can't be fully > utilized. Instead of doing public resource localization in > parallel(multithreading), public resource localization is serialized most of > the time. > And also PublicLocalizer#addResource is running in Dispatcher thread, > So the Dispatcher thread will be blocked by PublicLocalizer#addResource for > long time. -- This message was sent by Atlassian JIRA (v6.3.4#6332)