[ https://issues.apache.org/jira/browse/HDFS-1377?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Eli Collins updated HDFS-1377: ------------------------------ Fix Version/s: 0.20.204.0 It is already checked into branch-0.20-security and branch-0.20-security-204. Updating the fix version to match. Would be great if this were done for the other jiras in the branch. > Quota bug for partial blocks allows quotas to be violated > ---------------------------------------------------------- > > Key: HDFS-1377 > URL: https://issues.apache.org/jira/browse/HDFS-1377 > Project: Hadoop HDFS > Issue Type: Bug > Components: name-node > Affects Versions: 0.20.1, 0.20.2, 0.21.0, 0.22.0, 0.23.0 > Reporter: Eli Collins > Assignee: Eli Collins > Priority: Blocker > Fix For: 0.20.3, 0.20.204.0, 0.21.1, Federation Branch, 0.22.0, > 0.23.0 > > Attachments: HDFS-1377.patch, hdfs-1377-1.patch, > hdfs-1377-b20-1.patch, hdfs-1377-b20-2.patch, hdfs-1377-b20-3.patch > > > There's a bug in the quota code that causes them not to be respected when a > file is not an exact multiple of the block size. Here's an example: > {code} > $ hadoop fs -mkdir /test > $ hadoop dfsadmin -setSpaceQuota 384M /test > $ ls dir/ | wc -l # dir contains 101 files > 101 > $ du -ms dir # each is 3mb > 304 dir > $ hadoop fs -put dir /test > $ hadoop fs -count -q /test > none inf 402653184 -550502400 2 > 101 317718528 hdfs://haus01.sf.cloudera.com:10020/test > $ hadoop fs -stat "%o %r" /test/dir/f30 > 134217728 3 # three 128mb blocks > {code} > INodeDirectoryWithQuota caches the number of bytes consumed by it's children > in {{diskspace}}. The quota adjustment code has a bug that causes > {{diskspace}} to get updated incorrectly when a file is not an exact multiple > of the block size (the value ends up being negative). > This causes the quota checking code to think that the files in the directory > consumes less space than they actually do, so the verifyQuota does not throw > a QuotaExceededException even when the directory is over quota. However the > bug isn't visible to users because {{fs count -q}} reports the numbers > generated by INode#getContentSummary which adds up the sizes of the blocks > rather than use the cached INodeDirectoryWithQuota#diskspace value. > In FSDirectory#addBlock the disk space consumed is set conservatively to the > full block size * the number of replicas: > {code} > updateCount(inodes, inodes.length-1, 0, > fileNode.getPreferredBlockSize()*fileNode.getReplication(), true); > {code} > In FSNameSystem#addStoredBlock we adjust for this conservative estimate by > subtracting out the difference between the conservative estimate and what the > number of bytes actually stored was: > {code} > //Updated space consumed if required. > INodeFile file = (storedBlock != null) ? storedBlock.getINode() : null; > long diff = (file == null) ? 0 : > (file.getPreferredBlockSize() - storedBlock.getNumBytes()); > if (diff > 0 && file.isUnderConstruction() && > cursize < storedBlock.getNumBytes()) { > ... > dir.updateSpaceConsumed(path, 0, -diff*file.getReplication()); > {code} > We do the same in FSDirectory#replaceNode when completing the file, but at a > file granularity (I believe the intent here is to correct for the cases when > there's a failure replicating blocks and recovery). Since oldnode is under > construction INodeFile#diskspaceConsumed will use the preferred block size > (vs of Block#getNumBytes used by newnode) so we will again subtract out the > difference between the full block size and what the number of bytes actually > stored was: > {code} > long dsOld = oldnode.diskspaceConsumed(); > ... > //check if disk space needs to be updated. > long dsNew = 0; > if (updateDiskspace && (dsNew = newnode.diskspaceConsumed()) != dsOld) { > try { > updateSpaceConsumed(path, 0, dsNew-dsOld); > ... > {code} > So in the above example we started with diskspace at 384mb (3 * 128mb) and > then we subtract 375mb (to reflect only 9mb raw was actually used) twice so > for each file the diskspace for the directory is - 366mb (384mb minus 2 * > 375mb). Which is why the quota gets negative and yet we can still write more > files. > So a directory with lots of single block files (if you have multiple blocks > on the final partial block ends up subtracting from the diskspace used) ends > up having a quota that's way off. > I think the fix is to in FSDirectory#replaceNode not have the > diskspaceConsumed calculations differ when the old and new INode have the > same blocks. I'll work on a patch which also adds a quota test for blocks > that are not multiples of the block size and warns in > INodeDirectory#computeContentSummary if the computed size does not reflect > the cached value. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira