[ https://issues.apache.org/jira/browse/MAPREDUCE-4907?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13549251#comment-13549251 ]
Alejandro Abdelnur commented on MAPREDUCE-4907: ----------------------------------------------- +1 > TrackerDistributedCacheManager issues too many getFileStatus calls > ------------------------------------------------------------------ > > Key: MAPREDUCE-4907 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-4907 > Project: Hadoop Map/Reduce > Issue Type: Improvement > Components: mrv1, tasktracker > Affects Versions: 1.1.1 > Reporter: Sandy Ryza > Assignee: Sandy Ryza > Attachments: MAPREDUCE-4907.patch, MAPREDUCE-4907-trunk-1.patch, > MAPREDUCE-4907-trunk-1.patch, MAPREDUCE-4907-trunk-1.patch, > MAPREDUCE-4907-trunk.patch > > > TrackerDistributedCacheManager issues a number of redundant getFileStatus > calls when determining the timestamps and visibilities of files in the > distributed cache. 300 distributed cache files deep in the directory > structure can hammer HDFS with a couple thousand requests. > A couple optimizations can reduce this load: > 1. determineTimestamps and determineCacheVisibilities both call getFileStatus > on every file. We could cache the results of the former and use them for the > latter. > 2. determineCacheVisibilities needs to check that all ancestor directories of > each file have execute permissions for everyone. This currently entails a > getFileStatus on each ancestor directory for each file. The results of these > getFileStatus calls could be cached as well. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira