[jira] [Comment Edited] (HDFS-10797) Disk usage summary of snapshots causes renamed blocks to get counted twice

2017-04-21 Thread Junping Du (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-10797?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15979494#comment-15979494
 ] 

Junping Du edited comment on HDFS-10797 at 4/21/17 10:28 PM:
-

Hi [~mackrorysd] and [~xiaochen], can you take a look at HDFS-11661 which could 
be caused by improvement here? Thx!


was (Author: djp):
Hi [~mackrorysd], can you take a look at HDFS-11661 which could be caused by 
improvement here? Thx!

> Disk usage summary of snapshots causes renamed blocks to get counted twice
> --
>
> Key: HDFS-10797
> URL: https://issues.apache.org/jira/browse/HDFS-10797
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: snapshots
>Affects Versions: 2.8.0
>Reporter: Sean Mackrory
>Assignee: Sean Mackrory
> Fix For: 2.8.0, 3.0.0-alpha2
>
> Attachments: HDFS-10797.001.patch, HDFS-10797.002.patch, 
> HDFS-10797.003.patch, HDFS-10797.004.patch, HDFS-10797.005.patch, 
> HDFS-10797.006.patch, HDFS-10797.007.patch, HDFS-10797.008.patch, 
> HDFS-10797.009.patch, HDFS-10797.010.patch, HDFS-10797.010.patch
>
>
> DirectoryWithSnapshotFeature.computeContentSummary4Snapshot calculates how 
> much disk usage is used by a snapshot by tallying up the files in the 
> snapshot that have since been deleted (that way it won't overlap with regular 
> files whose disk usage is computed separately). However that is determined 
> from a diff that shows moved (to Trash or otherwise) or renamed files as a 
> deletion and a creation operation that may overlap with the list of blocks. 
> Only the deletion operation is taken into consideration, and this causes 
> those blocks to get represented twice in the disk usage tallying.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (HDFS-10797) Disk usage summary of snapshots causes renamed blocks to get counted twice

2016-09-29 Thread Sean Mackrory (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-10797?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15533078#comment-15533078
 ] 

Sean Mackrory edited comment on HDFS-10797 at 9/29/16 3:27 PM:
---

Now attaching a patch that should give what I think we agree are the completely 
correct semantics:
- du -s in a given directory will yield the space consumed by all snapshots and 
current files in the relevant directory, or below, regardless of subsequent 
renames or deletions.
- du -s in a parent directory may yield a smaller value than the sum of all du 
-s results in child directories, if files have been snapshotted in one child 
directory and then move to another. There is overlap in the space consumed by 
each directory in this case.
- In no cases is any INode counted twice in the context of a single du -s 
computation.

In my opinion, these semantics are the most correct and least surprising to 
users, and they are consistent. If I understand your first reply correctly, I 
*think* you would agree with this, [~jingzhao]? And the implementation resolves 
the inconsistency we were trying to avoid. Let me know if I've misunderstood 
anything...

Attaching the patch, but going to do a little more manual testing, tinkering 
and thinking about this before "submitting" again since this is a very 
different approach from my previous patches.

edit:

{quote}To me a better semantic can be like this: if the renamed source (which 
is inside of some snapshot) and the renamed target are both under the same 
directory for counting, we count them once. Otherwise they will be counted 
separately.{quote}

So I *think* what you're describing is what I ended up with in the .004 patch. 
I do think that's a step in the right direction, but still a little surprising 
and non-intuitive for someone who hasn't read our definition of the semantics 
carefully. I think .005 yields the ideal semantics I was originally going for. 
Do you agree?


was (Author: mackrorysd):
Now attaching a patch that should give what I think we agree are the completely 
correct semantics:
- du -s in a given directory will yield the space consumed by all snapshots and 
current files in the relevant directory, or below, regardless of subsequent 
renames or deletions.
- du -s in a parent directory may yield a smaller value than the sum of all du 
-s results in child directories, if files have been snapshotted in one child 
directory and then move to another. There is overlap in the space consumed by 
each directory in this case.
- In no cases is any INode counted twice in the context of a single du -s 
computation.

In my opinion, these semantics are the most correct and least surprising to 
users, and they are consistent. If I understand your first reply correctly, I 
*think* you would agree with this, [~jingzhao]? And the implementation resolves 
the inconsistency we were trying to avoid. Let me know if I've misunderstood 
anything...

Attaching the patch, but going to do a little more manual testing, tinkering 
and thinking about this before "submitting" again since this is a very 
different approach from my previous patches.

> Disk usage summary of snapshots causes renamed blocks to get counted twice
> --
>
> Key: HDFS-10797
> URL: https://issues.apache.org/jira/browse/HDFS-10797
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Sean Mackrory
>Assignee: Sean Mackrory
> Attachments: HDFS-10797.001.patch, HDFS-10797.002.patch, 
> HDFS-10797.003.patch, HDFS-10797.004.patch, HDFS-10797.005.patch
>
>
> DirectoryWithSnapshotFeature.computeContentSummary4Snapshot calculates how 
> much disk usage is used by a snapshot by tallying up the files in the 
> snapshot that have since been deleted (that way it won't overlap with regular 
> files whose disk usage is computed separately). However that is determined 
> from a diff that shows moved (to Trash or otherwise) or renamed files as a 
> deletion and a creation operation that may overlap with the list of blocks. 
> Only the deletion operation is taken into consideration, and this causes 
> those blocks to get represented twice in the disk usage tallying.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org