[ https://issues.apache.org/jira/browse/HDFS-10797?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15524191#comment-15524191 ]
Jing Zhao commented on HDFS-10797: ---------------------------------- Thanks for working on this, Sean. This is actually a known issue and the reason we currently choose to double the blocks is to make the semantic consistent with the scenario that the rename op moves the file out the original parent directory (e.g., into another snapshottable sub-tree or even a non-snapshottable subtree). In the later case, the moved files have to be counted in both the source directory and the target directory, since it belongs to snapshots of the source, and the current target directory. The scenario described in this jira is a special case. I agree it's strange to find that a file is counted twice after renaming under the same directory. However, with the current change the semantic will become inconsistent. For e.g., we will sometimes count the file twice and sometimes only once in the following scenario: 1. move the file out of the original parent directory 2. optional: take a new snapshot 3. move the file back Without step 2 then the file is counted only once, but with step 2 since the newly created item is in another snapshot diff the file will be counted twice. Or if the file is renamed into a subdirectory of its parent directory (/foo/bar --> /foo/subdir/bar), this file is still double counted. From the end user point of view this inconsistency is also strange. So currently I'm leaning towards always doing double count. Thoughts? > Disk usage summary of snapshots causes renamed blocks to get counted twice > -------------------------------------------------------------------------- > > Key: HDFS-10797 > URL: https://issues.apache.org/jira/browse/HDFS-10797 > Project: Hadoop HDFS > Issue Type: Bug > Reporter: Sean Mackrory > Assignee: Sean Mackrory > Attachments: HDFS-10797.001.patch, HDFS-10797.002.patch, > HDFS-10797.003.patch > > > DirectoryWithSnapshotFeature.computeContentSummary4Snapshot calculates how > much disk usage is used by a snapshot by tallying up the files in the > snapshot that have since been deleted (that way it won't overlap with regular > files whose disk usage is computed separately). However that is determined > from a diff that shows moved (to Trash or otherwise) or renamed files as a > deletion and a creation operation that may overlap with the list of blocks. > Only the deletion operation is taken into consideration, and this causes > those blocks to get represented twice in the disk usage tallying. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org