[ https://issues.apache.org/jira/browse/HADOOP-6857?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14171831#comment-14171831 ]
Byron Wong commented on HADOOP-6857: ------------------------------------ In the case when a directory /D and snapshot S are in the exact same state (e.g. a fresh snapshot has been made), everything works fine, meaning the sum of the disk consumed numbers reported by -du /D equals the disk consumed number reported by -du -s /D. When /D and S start deviating (files getting renamed, deleted, etc.), the disk consumed calculation will take the lastFileSize within the snapshots, find the maximum replication factor for that file within the snapshots, multiply the 2 together, and increment disk consumed by that number, which inflates the total disk consumed calculation, so -du -s /D > the sum of numbers in -du /D. I'd also like to point out that this implementation only takes replication factor of a file into account, even if that replication factor is greater than number of data nodes, which further inflates the -du calculation. For example, if we setrep 10 a file when we only have 3 datanodes, -du will still multiply fileLength * 10, and report that number. > FsShell should report raw disk usage including replication factor > ----------------------------------------------------------------- > > Key: HADOOP-6857 > URL: https://issues.apache.org/jira/browse/HADOOP-6857 > Project: Hadoop Common > Issue Type: Improvement > Components: fs > Reporter: Alex Kozlov > Assignee: Byron Wong > Attachments: HADOOP-6857.patch, show-space-consumed.txt > > > Currently FsShell report HDFS usage with "hadoop fs -dus <path>" command. > Since replication level is per file level, it would be nice to add raw disk > usage including the replication factor (maybe "hadoop fs -dus -raw <path>"?). > This will allow to assess resource usage more accurately. -- Alex K -- This message was sent by Atlassian JIRA (v6.3.4#6332)