[jira] Commented: (HDFS-946) NameNode should not return full path name when lisitng a diretory or getting the status of a file
[ https://issues.apache.org/jira/browse/HDFS-946?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12853868#action_12853868 ] Hudson commented on HDFS-946: - Integrated in Hdfs-Patch-h5.grid.sp2.yahoo.net #302 (See [http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h5.grid.sp2.yahoo.net/302/]) NameNode should not return full path name when lisitng a diretory or getting the status of a file - Key: HDFS-946 URL: https://issues.apache.org/jira/browse/HDFS-946 Project: Hadoop HDFS Issue Type: Improvement Reporter: Hairong Kuang Assignee: Hairong Kuang Fix For: 0.22.0 Attachments: HdfsFileStatus-yahoo20.patch, HDFSFileStatus.patch, HDFSFileStatus1.patch, HdfsFileStatus3.patch, HdfsFileStatus4.patch, HdfsFileStatusProxy-Yahoo20.patch FSDirectory#getListring(String src) has the following code: int i = 0; for (INode cur : contents) { listing[i] = createFileStatus(srcs+cur.getLocalName(), cur); i++; } So listing a directory will return an array of FileStatus. Each FileStatus element has the full path name. This increases the return message size and adds non-negligible CPU time to the operation. FSDirectory#getFileInfo(String) does not need to return the file name either. Another optimization is that in the version of FileStatus that's used in the wire protocol, the field path does not need to be Path; It could be a String or a byte array ideally. This could avoid unnecessary creation of the Path objects at NameNode, thus help reduce the GC problem observed when a large number of getFileInfo or getListing operations hit NameNode. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HDFS-946) NameNode should not return full path name when lisitng a diretory or getting the status of a file
[ https://issues.apache.org/jira/browse/HDFS-946?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12853958#action_12853958 ] Hudson commented on HDFS-946: - Integrated in Hadoop-Hdfs-trunk #275 (See [http://hudson.zones.apache.org/hudson/job/Hadoop-Hdfs-trunk/275/]) NameNode should not return full path name when lisitng a diretory or getting the status of a file - Key: HDFS-946 URL: https://issues.apache.org/jira/browse/HDFS-946 Project: Hadoop HDFS Issue Type: Improvement Reporter: Hairong Kuang Assignee: Hairong Kuang Fix For: 0.22.0 Attachments: HdfsFileStatus-yahoo20.patch, HDFSFileStatus.patch, HDFSFileStatus1.patch, HdfsFileStatus3.patch, HdfsFileStatus4.patch, HdfsFileStatusProxy-Yahoo20.patch FSDirectory#getListring(String src) has the following code: int i = 0; for (INode cur : contents) { listing[i] = createFileStatus(srcs+cur.getLocalName(), cur); i++; } So listing a directory will return an array of FileStatus. Each FileStatus element has the full path name. This increases the return message size and adds non-negligible CPU time to the operation. FSDirectory#getFileInfo(String) does not need to return the file name either. Another optimization is that in the version of FileStatus that's used in the wire protocol, the field path does not need to be Path; It could be a String or a byte array ideally. This could avoid unnecessary creation of the Path objects at NameNode, thus help reduce the GC problem observed when a large number of getFileInfo or getListing operations hit NameNode. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HDFS-946) NameNode should not return full path name when lisitng a diretory or getting the status of a file
[ https://issues.apache.org/jira/browse/HDFS-946?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12853659#action_12853659 ] Hudson commented on HDFS-946: - Integrated in Hdfs-Patch-h2.grid.sp2.yahoo.net #146 (See [http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h2.grid.sp2.yahoo.net/146/]) NameNode should not return full path name when lisitng a diretory or getting the status of a file - Key: HDFS-946 URL: https://issues.apache.org/jira/browse/HDFS-946 Project: Hadoop HDFS Issue Type: Improvement Reporter: Hairong Kuang Assignee: Hairong Kuang Fix For: 0.22.0 Attachments: HdfsFileStatus-yahoo20.patch, HDFSFileStatus.patch, HDFSFileStatus1.patch, HdfsFileStatus3.patch, HdfsFileStatus4.patch, HdfsFileStatusProxy-Yahoo20.patch FSDirectory#getListring(String src) has the following code: int i = 0; for (INode cur : contents) { listing[i] = createFileStatus(srcs+cur.getLocalName(), cur); i++; } So listing a directory will return an array of FileStatus. Each FileStatus element has the full path name. This increases the return message size and adds non-negligible CPU time to the operation. FSDirectory#getFileInfo(String) does not need to return the file name either. Another optimization is that in the version of FileStatus that's used in the wire protocol, the field path does not need to be Path; It could be a String or a byte array ideally. This could avoid unnecessary creation of the Path objects at NameNode, thus help reduce the GC problem observed when a large number of getFileInfo or getListing operations hit NameNode. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HDFS-946) NameNode should not return full path name when lisitng a diretory or getting the status of a file
[ https://issues.apache.org/jira/browse/HDFS-946?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12836984#action_12836984 ] Hudson commented on HDFS-946: - Integrated in Hadoop-Hdfs-trunk-Commit #197 (See [http://hudson.zones.apache.org/hudson/job/Hadoop-Hdfs-trunk-Commit/197/]) . NameNode should not return full path name when lisitng a diretory or getting the status of a file. Contributed by Hairong Kuang. NameNode should not return full path name when lisitng a diretory or getting the status of a file - Key: HDFS-946 URL: https://issues.apache.org/jira/browse/HDFS-946 Project: Hadoop HDFS Issue Type: Improvement Reporter: Hairong Kuang Assignee: Hairong Kuang Fix For: 0.22.0 Attachments: HDFSFileStatus.patch, HDFSFileStatus1.patch, HdfsFileStatus3.patch, HdfsFileStatus4.patch FSDirectory#getListring(String src) has the following code: int i = 0; for (INode cur : contents) { listing[i] = createFileStatus(srcs+cur.getLocalName(), cur); i++; } So listing a directory will return an array of FileStatus. Each FileStatus element has the full path name. This increases the return message size and adds non-negligible CPU time to the operation. FSDirectory#getFileInfo(String) does not need to return the file name either. Another optimization is that in the version of FileStatus that's used in the wire protocol, the field path does not need to be Path; It could be a String or a byte array ideally. This could avoid unnecessary creation of the Path objects at NameNode, thus help reduce the GC problem observed when a large number of getFileInfo or getListing operations hit NameNode. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HDFS-946) NameNode should not return full path name when lisitng a diretory or getting the status of a file
[ https://issues.apache.org/jira/browse/HDFS-946?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12835876#action_12835876 ] Suresh Srinivas commented on HDFS-946: -- +1 the patch looks good. NameNode should not return full path name when lisitng a diretory or getting the status of a file - Key: HDFS-946 URL: https://issues.apache.org/jira/browse/HDFS-946 Project: Hadoop HDFS Issue Type: Improvement Reporter: Hairong Kuang Assignee: Hairong Kuang Fix For: 0.22.0 Attachments: HDFSFileStatus.patch, HDFSFileStatus1.patch, HdfsFileStatus3.patch FSDirectory#getListring(String src) has the following code: int i = 0; for (INode cur : contents) { listing[i] = createFileStatus(srcs+cur.getLocalName(), cur); i++; } So listing a directory will return an array of FileStatus. Each FileStatus element has the full path name. This increases the return message size and adds non-negligible CPU time to the operation. FSDirectory#getFileInfo(String) does not need to return the file name either. Another optimization is that in the version of FileStatus that's used in the wire protocol, the field path does not need to be Path; It could be a String or a byte array ideally. This could avoid unnecessary creation of the Path objects at NameNode, thus help reduce the GC problem observed when a large number of getFileInfo or getListing operations hit NameNode. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HDFS-946) NameNode should not return full path name when lisitng a diretory or getting the status of a file
[ https://issues.apache.org/jira/browse/HDFS-946?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12834396#action_12834396 ] Suresh Srinivas commented on HDFS-946: -- # TestDFSShell.java not sure why the methods are named starting with caps. Also is the change to this file needed? # FSDirectory.createFileStatus - consider moving isDirectory check outside. Also current code extends beyond 80 columns. # HDFSFileStatus #* consider naming it HdfsFileStatus #* final static public should public static final #* since this if for HDFS, comments in the code about different notions in the FS is not required in methods getPermission(), getOwner(), getGroup(), #* Some of the method parameters and other variables could be declared final # getFulName() - without unnecessary else code is more readable. Same for getFullPath() NameNode should not return full path name when lisitng a diretory or getting the status of a file - Key: HDFS-946 URL: https://issues.apache.org/jira/browse/HDFS-946 Project: Hadoop HDFS Issue Type: Improvement Reporter: Hairong Kuang Assignee: Hairong Kuang Fix For: 0.22.0 Attachments: HDFSFileStatus.patch, HDFSFileStatus1.patch FSDirectory#getListring(String src) has the following code: int i = 0; for (INode cur : contents) { listing[i] = createFileStatus(srcs+cur.getLocalName(), cur); i++; } So listing a directory will return an array of FileStatus. Each FileStatus element has the full path name. This increases the return message size and adds non-negligible CPU time to the operation. FSDirectory#getFileInfo(String) does not need to return the file name either. Another optimization is that in the version of FileStatus that's used in the wire protocol, the field path does not need to be Path; It could be a String or a byte array ideally. This could avoid unnecessary creation of the Path objects at NameNode, thus help reduce the GC problem observed when a large number of getFileInfo or getListing operations hit NameNode. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HDFS-946) NameNode should not return full path name when lisitng a diretory or getting the status of a file
[ https://issues.apache.org/jira/browse/HDFS-946?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12832745#action_12832745 ] Hairong Kuang commented on HDFS-946: If you are proposing that the object that is sent over-the-wire is different from FileStatus. If so, please consider the requirement of HDFS-878 too. This jira tries to reduce the cost of getFileInfo and listing a directory, where HDFS-878 adds cost to these two operations.. So I will not implement HDFS-878 in this jira. Since we are having so many problems with getFileInfo and list a directory, we should be very cautious about adding anything to FileStatus in hdfs unless it is absolutely necessary. I have conducted some experiments with my patch. I write an application that spawns 100 threads, each of which lists a directory of size 1300 for 200 times. I use yourKit to profile the NameNode while the application is running. Without the patch, NameNode's CPU utilization is 20~26% and time spent on GC is 3~5%. With the patch, NameNode's CPU utilization drops to 12~17% and the time spent on GS is mostly 0% but occasionally becomes 1 or 2%. NameNode should not return full path name when lisitng a diretory or getting the status of a file - Key: HDFS-946 URL: https://issues.apache.org/jira/browse/HDFS-946 Project: Hadoop HDFS Issue Type: Improvement Reporter: Hairong Kuang Assignee: Hairong Kuang Fix For: 0.22.0 Attachments: HDFSFileStatus.patch, HDFSFileStatus1.patch FSDirectory#getListring(String src) has the following code: int i = 0; for (INode cur : contents) { listing[i] = createFileStatus(srcs+cur.getLocalName(), cur); i++; } So listing a directory will return an array of FileStatus. Each FileStatus element has the full path name. This increases the return message size and adds non-negligible CPU time to the operation. FSDirectory#getFileInfo(String) does not need to return the file name either. Another optimization is that in the version of FileStatus that's used in the wire protocol, the field path does not need to be Path; It could be a String or a byte array ideally. This could avoid unnecessary creation of the Path objects at NameNode, thus help reduce the GC problem observed when a large number of getFileInfo or getListing operations hit NameNode. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HDFS-946) NameNode should not return full path name when lisitng a diretory or getting the status of a file
[ https://issues.apache.org/jira/browse/HDFS-946?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12829564#action_12829564 ] dhruba borthakur commented on HDFS-946: --- Client's should continue to get the full path name in a FileStatus object, isn't it? Otherwise many many existing client applications will break. If you are proposing that the object that is sent over-the-wire is different from FileStatus. If so, please consider the requirement of HDFS-878 too. NameNode should not return full path name when lisitng a diretory or getting the status of a file - Key: HDFS-946 URL: https://issues.apache.org/jira/browse/HDFS-946 Project: Hadoop HDFS Issue Type: Improvement Reporter: Hairong Kuang Fix For: 0.22.0 FSDirectory#getListring(String src) has the following code: int i = 0; for (INode cur : contents) { listing[i] = createFileStatus(srcs+cur.getLocalName(), cur); i++; } So listing a directory will return an array of FileStatus. Each FileStatus element has the full path name. This increases the return message size and adds non-negligible CPU time to the operation. FSDirectory#getFileInfo(String) does not need to return the file name either. Another optimization is that in the version of FileStatus that's used in the wire protocol, the field path does not need to be Path; It could be a String or a byte array ideally. This could avoid unnecessary creation of the Path objects at NameNode, thus help reduce the GC problem observed when a large number of getFileInfo or getListing operations hit NameNode. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HDFS-946) NameNode should not return full path name when lisitng a diretory or getting the status of a file
[ https://issues.apache.org/jira/browse/HDFS-946?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12829250#action_12829250 ] Doug Cutting commented on HDFS-946: --- This sounds reasonable. But the client would still return fully-qualified paths, no? NameNode should not return full path name when lisitng a diretory or getting the status of a file - Key: HDFS-946 URL: https://issues.apache.org/jira/browse/HDFS-946 Project: Hadoop HDFS Issue Type: Improvement Reporter: Hairong Kuang Fix For: 0.22.0 FSDirectory#getListring(String src) has the following code: int i = 0; for (INode cur : contents) { listing[i] = createFileStatus(srcs+cur.getLocalName(), cur); i++; } So listing a directory will return an array of FileStatus. Each FileStatus element has the full path name. This increases the return message size and adds non-negligible CPU time to the operation. FSDirectory#getFileInfo(String) does not need to return the file name either. Another optimization is that in the version of FileStatus that's used in the wire protocol, the field path does not need to be Path; It could be a String or a byte array ideally. This could avoid unnecessary creation of the Path objects at NameNode, thus help reduce the GC problem observed when a large number of getFileInfo or getListing operations hit NameNode. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HDFS-946) NameNode should not return full path name when lisitng a diretory or getting the status of a file
[ https://issues.apache.org/jira/browse/HDFS-946?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12829269#action_12829269 ] Hairong Kuang commented on HDFS-946: For this jira, the client will still return fully-qualified paths. But I am thinking that even at the FileContext level it is not necessary to return fully-qualified paths. However this is a user-facing incompatible change. I would prefer to discuss it in a different jira. NameNode should not return full path name when lisitng a diretory or getting the status of a file - Key: HDFS-946 URL: https://issues.apache.org/jira/browse/HDFS-946 Project: Hadoop HDFS Issue Type: Improvement Reporter: Hairong Kuang Fix For: 0.22.0 FSDirectory#getListring(String src) has the following code: int i = 0; for (INode cur : contents) { listing[i] = createFileStatus(srcs+cur.getLocalName(), cur); i++; } So listing a directory will return an array of FileStatus. Each FileStatus element has the full path name. This increases the return message size and adds non-negligible CPU time to the operation. FSDirectory#getFileInfo(String) does not need to return the file name either. Another optimization is that in the version of FileStatus that's used in the wire protocol, the field path does not need to be Path; It could be a String or a byte array ideally. This could avoid unnecessary creation of the Path objects at NameNode, thus help reduce the GC problem observed when a large number of getFileInfo or getListing operations hit NameNode. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.