[ https://issues.apache.org/jira/browse/HADOOP-6502?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13208203#comment-13208203 ]
Aaron T. Myers commented on HADOOP-6502: ---------------------------------------- Tiny nit: In "} else {//check already performed on this class name" please put a space between "{" and "//". Otherwise the patch looks good to me. +1. > DistributedFileSystem#listStatus is very slow when listing a directory with a > size of 1300 > ------------------------------------------------------------------------------------------ > > Key: HADOOP-6502 > URL: https://issues.apache.org/jira/browse/HADOOP-6502 > Project: Hadoop Common > Issue Type: Bug > Components: util > Affects Versions: 0.20.0 > Reporter: Hairong Kuang > Assignee: Todd Lipcon > Priority: Critical > Attachments: 6502.patch, 6502_v2.patch, hadoop-6502-trunk.txt, > hadoop-6502-trunk.txt > > > When listing a directory of around 1300 children, it takes hundreds of > milliseconds. It turns out the slowdowness is caused by the change made by > HADOOP-4187. The return value of listStatus is an array of FileStatus. When > deserializing each element of the array, > ReflectionUtils#newInstance(Class<T>, Configuration) is called and then calls > setConf, which calls setJobConf. SetJobConf checks if JobConf is on the class > path by calling Configuration#getClassByName. Even though > Configuration#getClassByName tries to optimize the lookup using a cached map, > but since JobConf is not in the class path, so it is not in the cache. Every > checkup ends up calling Class.ForName which is very expensive. Deserializing > an array of 1300 entries requires calling of Class#ForName 1300 times! -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira