[ 
https://issues.apache.org/jira/browse/HDFS-1377?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12933247#action_12933247
 ] 

Eli Collins commented on HDFS-1377:
-----------------------------------

bq. So what is the original idea of the code to be removed in your patch?

If the oldnode and newnode arguments to replaceNode always have the same blocks 
then the disk updating space code in replaceNode is redundant. This code goes 
back to the original space quota addition (HADOOP-3938), and it looks like it 
was redundant as well there (note that the comment near the code even says 
"Currently oldnode and newnode are assumed to contain the same blocks" so 
perhaps they were trying to be conservative).

So the question becomes do oldnode and newnode ever have different blocks? On 
branch 20 I don't think that's the case. The only caller where it seems that it 
could potentially be the case is loadFilesUnderConstruction, in which case the 
under construction INode may have an extra block, but then wouldn't that have 
been accounted for via addStoredBlock already?

I still need to look at the code for trunk more thoroughly, the latest code, 
particularly append, makes things less obvious. I do think we should remove 
this particular file-level accounting code in replaceNode, and always make 
quota adjustments at block granularity.

> Quota bug for partial blocks allows quotas to be violated 
> ----------------------------------------------------------
>
>                 Key: HDFS-1377
>                 URL: https://issues.apache.org/jira/browse/HDFS-1377
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: name-node
>    Affects Versions: 0.20.1, 0.20.2, 0.21.0
>            Reporter: Eli Collins
>            Assignee: Eli Collins
>            Priority: Blocker
>             Fix For: 0.20.3, 0.21.1, 0.22.0
>
>         Attachments: HDFS-1377.patch
>
>
> There's a bug in the quota code that causes them not to be respected when a 
> file is not an exact multiple of the block size. Here's an example:
> {code}
> $ hadoop fs -mkdir /test
> $ hadoop dfsadmin -setSpaceQuota 384M /test
> $ ls dir/ | wc -l   # dir contains 101 files
> 101
> $ du -ms dir        # each is 3mb
> 304   dir
> $ hadoop fs -put dir /test
> $ hadoop fs -count -q /test
>         none             inf       402653184      -550502400            2     
>      101          317718528 hdfs://haus01.sf.cloudera.com:10020/test
> $ hadoop fs -stat "%o %r" /test/dir/f30
> 134217728 3    # three 128mb blocks
> {code}
> INodeDirectoryWithQuota caches the number of bytes consumed by it's children 
> in {{diskspace}}. The quota adjustment code has a bug that causes 
> {{diskspace}} to get updated incorrectly when a file is not an exact multiple 
> of the block size (the value ends up being negative). 
> This causes the quota checking code to think that the files in the directory 
> consumes less space than they actually do, so the verifyQuota does not throw 
> a QuotaExceededException even when the directory is over quota. However the 
> bug isn't visible to users because {{fs count -q}} reports the numbers 
> generated by INode#getContentSummary which adds up the sizes of the blocks 
> rather than use the cached INodeDirectoryWithQuota#diskspace value.
> In FSDirectory#addBlock the disk space consumed is set conservatively to the 
> full block size * the number of replicas:
> {code}
> updateCount(inodes, inodes.length-1, 0,
>     fileNode.getPreferredBlockSize()*fileNode.getReplication(), true);
> {code}
> In FSNameSystem#addStoredBlock we adjust for this conservative estimate by 
> subtracting out the difference between the conservative estimate and what the 
> number of bytes actually stored was:
> {code}
> //Updated space consumed if required.
> INodeFile file = (storedBlock != null) ? storedBlock.getINode() : null;
> long diff = (file == null) ? 0 :
>     (file.getPreferredBlockSize() - storedBlock.getNumBytes());
> if (diff > 0 && file.isUnderConstruction() &&
>     cursize < storedBlock.getNumBytes()) {
> ...
>     dir.updateSpaceConsumed(path, 0, -diff*file.getReplication());
> {code}
> We do the same in FSDirectory#replaceNode when completing the file, but at a 
> file granularity (I believe the intent here is to correct for the cases when 
> there's a failure replicating blocks and recovery). Since oldnode is under 
> construction INodeFile#diskspaceConsumed will use the preferred block size  
> (vs of Block#getNumBytes used by newnode) so we will again subtract out the 
> difference between the full block size and what the number of bytes actually 
> stored was:
> {code}
> long dsOld = oldnode.diskspaceConsumed();
> ...
> //check if disk space needs to be updated.
> long dsNew = 0;
> if (updateDiskspace && (dsNew = newnode.diskspaceConsumed()) != dsOld) {
>   try {
>     updateSpaceConsumed(path, 0, dsNew-dsOld);
> ...
> {code}
> So in the above example we started with diskspace at 384mb (3 * 128mb) and 
> then we subtract 375mb (to reflect only 9mb raw was actually used) twice so 
> for each file the diskspace for the directory is - 366mb (384mb minus 2 * 
> 375mb). Which is why the quota gets negative and yet we can still write more 
> files.
> So a directory with lots of single block files (if you have multiple blocks 
> on the final partial block ends up subtracting from the diskspace used) ends 
> up having a quota that's way off.
> I think the fix is to in FSDirectory#replaceNode not have the 
> diskspaceConsumed calculations differ when the old and new INode have the 
> same blocks. I'll work on a patch which also adds a quota test for blocks 
> that are not multiples of the block size and warns in 
> INodeDirectory#computeContentSummary if the computed size does not reflect 
> the cached value.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to