[ https://issues.apache.org/jira/browse/HDFS-17497?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
ZanderXu updated HDFS-17497: ---------------------------- Description: One in-writing HDFS file may contains multiple committed blocks, as follows (assume one file contains three blocks): || ||Block 1||Block 2||Block 3|| |Case 1|Complete|Commit|UnderConstruction| |Case 2|Complete|Commit|Commit| |Case 3|Commit|Commit|Commit| But the logic for committed blocks is mixed when computing file size, it ignores the bytes of the last committed block and contains the bytes of other committed blocks. {code:java} public final long computeFileSize(boolean includesLastUcBlock, boolean usePreferredBlockSize4LastUcBlock) { if (blocks.length == 0) { return 0; } final int last = blocks.length - 1; //check if the last block is BlockInfoUnderConstruction BlockInfo lastBlk = blocks[last]; long size = lastBlk.getNumBytes(); // the last committed block is not complete, so it's bytes may be ignored. if (!lastBlk.isComplete()) { if (!includesLastUcBlock) { size = 0; } else if (usePreferredBlockSize4LastUcBlock) { size = isStriped()? getPreferredBlockSize() * ((BlockInfoStriped)lastBlk).getDataBlockNum() : getPreferredBlockSize(); } } // The bytes of other committed blocks are calculated into the file length. for (int i = 0; i < last; i++) { size += blocks[i].getNumBytes(); } return size; } {code} The bytes of one committed block will not be changed, so the bytes of the last committed block should be calculated into the file length too. And the logic for committed blocks is mixed too when computing file length in DFSInputStream. Normally DFSInputStream does not need to get visible length for committed block regardless of whether the committed block is the last block or not. -HDFS-10843- encountered one bug which actually caused by the committed block, but -HDFS-10843- fixed that bug by updating quota usage when completing block. The num of bytes of the committed block will no longer change, so we should update the quota usage when the block is committed, which can reduce the delta quota usage in time. So there are somethings we need to do: * Unify the calculation logic for all committed blocks in {{computeFileSize}} of {{INodeFile}} * Unify the calculation logic for all committed blocks in {{getFileLength}} of {{DFSInputStream}} * Update quota usage when committing block was: One in-writing HDFS file may contains multiple committed blocks, as follows (assume one file contains three blocks): || ||Block 1||Block 2||Block 3|| |Case 1|Complete|Commit|UnderConstruction| |Case 2|Complete|Commit|Commit| |Case 3|Commit|Commit|Commit| But the logic for committed blocks is mixed when computing file size, it ignores the bytes of the last committed block and contains the bytes of other committed blocks. {code:java} public final long computeFileSize(boolean includesLastUcBlock, boolean usePreferredBlockSize4LastUcBlock) { if (blocks.length == 0) { return 0; } final int last = blocks.length - 1; //check if the last block is BlockInfoUnderConstruction BlockInfo lastBlk = blocks[last]; long size = lastBlk.getNumBytes(); // the last committed block is not complete, so it's bytes may be ignored. if (!lastBlk.isComplete()) { if (!includesLastUcBlock) { size = 0; } else if (usePreferredBlockSize4LastUcBlock) { size = isStriped()? getPreferredBlockSize() * ((BlockInfoStriped)lastBlk).getDataBlockNum() : getPreferredBlockSize(); } } // The bytes of other committed blocks are calculated into the file length. for (int i = 0; i < last; i++) { size += blocks[i].getNumBytes(); } return size; } {code} The bytes of one committed block will not be changed, so the bytes of the last committed block should be calculated into the file length too. And the logic for committed blocks is mixed too when computing file length in DFSInputStream. Normally DFSInputStream doesn't get visible length for committed block regardless of whether the committed block is the last block or not. HDFS-10843 noticed one bug which actually caused by the committed block, but HDFS-10843 fixed that bug in another way. The num of bytes of the committed block will no longer change, so we should update the quota usage when the block is committed, which can reduce the delta quota usage in time. > Logic for committed blocks is mixed when computing file size > ------------------------------------------------------------ > > Key: HDFS-17497 > URL: https://issues.apache.org/jira/browse/HDFS-17497 > Project: Hadoop HDFS > Issue Type: Improvement > Reporter: ZanderXu > Priority: Major > Labels: pull-request-available > > One in-writing HDFS file may contains multiple committed blocks, as follows > (assume one file contains three blocks): > || ||Block 1||Block 2||Block 3|| > |Case 1|Complete|Commit|UnderConstruction| > |Case 2|Complete|Commit|Commit| > |Case 3|Commit|Commit|Commit| > > But the logic for committed blocks is mixed when computing file size, it > ignores the bytes of the last committed block and contains the bytes of other > committed blocks. > {code:java} > public final long computeFileSize(boolean includesLastUcBlock, > boolean usePreferredBlockSize4LastUcBlock) { > if (blocks.length == 0) { > return 0; > } > final int last = blocks.length - 1; > //check if the last block is BlockInfoUnderConstruction > BlockInfo lastBlk = blocks[last]; > long size = lastBlk.getNumBytes(); > // the last committed block is not complete, so it's bytes may be ignored. > if (!lastBlk.isComplete()) { > if (!includesLastUcBlock) { > size = 0; > } else if (usePreferredBlockSize4LastUcBlock) { > size = isStriped()? > getPreferredBlockSize() * > ((BlockInfoStriped)lastBlk).getDataBlockNum() : > getPreferredBlockSize(); > } > } > // The bytes of other committed blocks are calculated into the file length. > for (int i = 0; i < last; i++) { > size += blocks[i].getNumBytes(); > } > return size; > } {code} > The bytes of one committed block will not be changed, so the bytes of the > last committed block should be calculated into the file length too. > > And the logic for committed blocks is mixed too when computing file length in > DFSInputStream. Normally DFSInputStream does not need to get visible length > for committed block regardless of whether the committed block is the last > block or not. > > -HDFS-10843- encountered one bug which actually caused by the committed > block, but -HDFS-10843- fixed that bug by updating quota usage when > completing block. The num of bytes of the committed block will no longer > change, so we should update the quota usage when the block is committed, > which can reduce the delta quota usage in time. > > So there are somethings we need to do: > * Unify the calculation logic for all committed blocks in > {{computeFileSize}} of {{INodeFile}} > * Unify the calculation logic for all committed blocks in {{getFileLength}} > of {{DFSInputStream}} > * Update quota usage when committing block -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org