[jira] [Commented] (HDFS-10797) Disk usage summary of snapshots causes renamed blocks to get counted twice

Sean Mackrory (JIRA) Wed, 05 Oct 2016 09:59:34 -0700

    [ 
https://issues.apache.org/jira/browse/HDFS-10797?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15549321#comment-15549321
 ]


Sean Mackrory commented on HDFS-10797:
--------------------------------------

Thanks, [~xiaochen]. Except as noted below, I'll incorporate all your feedback 
into another patch...

{quote}I don't think that will be a critical path to impact du 
performance{quote}

Yeah - not sure if anything performance critical depends on du, but I would 
think correctness of the final result is far more important here anyway.

{quote}In nodeIncluded, we safeguard includedNodes in a synchronized block, but 
we also provide a getIncludedNodes method, which could potentially be updated 
by the caller. No real usage yet, but I just feel this a bit unsafe in general, 
maybe return a clone of it instead?{quote}

So my concern was not that the contents of the HashSet instance might change, 
but that the reference 'counts' temporarily points to a different object 
entirely when tallying the deleted, snapshotted INodes. Rather than protecting 
the data structures, it ensures no one can call getCounts() while counts would 
point to the wrong object. Beyond that, I think it's just as likely that 
threads calling getCounts in parallel will need their changes to propagate to 
the rest of the program, meaning the correct solution would be a thread-safe 
data structure rather than a clone. So I do think it's best to leave it as is 
until there is a use case for other concurrent accesses.

> Disk usage summary of snapshots causes renamed blocks to get counted twice
> --------------------------------------------------------------------------
>
>                 Key: HDFS-10797
>                 URL: https://issues.apache.org/jira/browse/HDFS-10797
>             Project: Hadoop HDFS
>          Issue Type: Bug
>            Reporter: Sean Mackrory
>            Assignee: Sean Mackrory
>         Attachments: HDFS-10797.001.patch, HDFS-10797.002.patch, 
> HDFS-10797.003.patch, HDFS-10797.004.patch, HDFS-10797.005.patch, 
> HDFS-10797.006.patch, HDFS-10797.007.patch, HDFS-10797.008.patch, 
> HDFS-10797.009.patch
>
>
> DirectoryWithSnapshotFeature.computeContentSummary4Snapshot calculates how 
> much disk usage is used by a snapshot by tallying up the files in the 
> snapshot that have since been deleted (that way it won't overlap with regular 
> files whose disk usage is computed separately). However that is determined 
> from a diff that shows moved (to Trash or otherwise) or renamed files as a 
> deletion and a creation operation that may overlap with the list of blocks. 
> Only the deletion operation is taken into consideration, and this causes 
> those blocks to get represented twice in the disk usage tallying.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Commented] (HDFS-10797) Disk usage summary of snapshots causes renamed blocks to get counted twice

Reply via email to