[ https://issues.apache.org/jira/browse/HDFS-1658?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13019687#comment-13019687 ]
Hairong Kuang commented on HDFS-1658: ------------------------------------- I want to discuss if we could pursue option #1. Right now if a path is a directory, FileStatus.length is actually undefined. It happens that we chose to put 0 there. I think my proposal is just to enhance the semantics, strictly speaking not an incompatible change. > It is unnecessary to call getFileInfo first The problem is that most applications work with FileStatus. For example getFileSplits in MapReduce has to get FileStatus for all files by traversing the input directories by calling getFileInfo and listStatus. If we can check a directory is empty by looking at its FileStatus, we can avoid issue a listStatus call to list its children. > A less expensive way to figure out directory size > ------------------------------------------------- > > Key: HDFS-1658 > URL: https://issues.apache.org/jira/browse/HDFS-1658 > Project: Hadoop HDFS > Issue Type: Improvement > Reporter: Hairong Kuang > Assignee: Hairong Kuang > > Currently in order to figure out a directory size, we have to list a > directory by calling RPC getListing and get the number of its children. This > is an expensive operation especially when a directory has many children > because it may require multiple RPCs. > On the other hand when fetching the status of a path (i.e. calling RPC > getFileInfo), the length field of FileStatus is set to be 0 if the path is a > directory. > I am thinking to change this field (FileStatus#length) to be the directory > size when the path is a directory. So we can call getFileInfo to get the > directory size. This call is much less expensive and simpler than getListing. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira