[jira] [Updated] (HDFS-17497) Logic for committed blocks is mixed when computing file size

ZanderXu (Jira) Tue, 23 Apr 2024 19:01:04 -0700


     [ 
https://issues.apache.org/jira/browse/HDFS-17497?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


ZanderXu updated HDFS-17497:
----------------------------
    Description: 
One in-writing HDFS file may contains multiple committed blocks, as follows 
(assume one file contains three blocks):
|| ||Block 1||Block 2||Block 3||
|Case 1|Complete|Commit|UnderConstruction|
|Case 2|Complete|Commit|Commit|
|Case 3|Commit|Commit|Commit|

 

But the logic for committed blocks is mixed when computing file size, it 
ignores the bytes of the last committed block and contains the bytes of other 
committed blocks.
{code:java}
public final long computeFileSize(boolean includesLastUcBlock,
    boolean usePreferredBlockSize4LastUcBlock) {
  if (blocks.length == 0) {
    return 0;
  }
  final int last = blocks.length - 1;
  //check if the last block is BlockInfoUnderConstruction
  BlockInfo lastBlk = blocks[last];
  long size = lastBlk.getNumBytes();
  // the last committed block is not complete, so it's bytes may be ignored.
  if (!lastBlk.isComplete()) {
     if (!includesLastUcBlock) {
       size = 0;
     } else if (usePreferredBlockSize4LastUcBlock) {
       size = isStriped()?
           getPreferredBlockSize() *
               ((BlockInfoStriped)lastBlk).getDataBlockNum() :
           getPreferredBlockSize();
     }
  }
  // The bytes of other committed blocks are calculated into the file length.
  for (int i = 0; i < last; i++) {
    size += blocks[i].getNumBytes();
  }
  return size;
} {code}
The bytes of one committed block will not be changed, so the bytes of the last 
committed block should be calculated into the file length too.

 

And the logic for committed blocks is mixed too when computing file length in 
DFSInputStream. Normally DFSInputStream does not need to get visible length for 
committed block regardless of whether the committed block is the last block or 
not.

 

-HDFS-10843- encountered one bug which actually caused by the committed block, 
but -HDFS-10843- fixed that bug by updating quota usage when completing block. 
The num of bytes of the committed block will no longer change, so we should 
update the quota usage when the block is committed, which can reduce the delta 
quota usage in time.

 

So there are somethings we need to do:
 * Unify the calculation logic for all committed blocks in {{computeFileSize}} 
of {{INodeFile}}
 * Unify the calculation logic for all committed blocks in {{getFileLength}} of 
{{DFSInputStream}}
 * Update quota usage when committing block

  was:
One in-writing HDFS file may contains multiple committed blocks, as follows 
(assume one file contains three blocks):
|| ||Block 1||Block 2||Block 3||
|Case 1|Complete|Commit|UnderConstruction|
|Case 2|Complete|Commit|Commit|
|Case 3|Commit|Commit|Commit|

 

But the logic for committed blocks is mixed when computing file size, it 
ignores the bytes of the last committed block and contains the bytes of other 
committed blocks.
{code:java}
public final long computeFileSize(boolean includesLastUcBlock,
    boolean usePreferredBlockSize4LastUcBlock) {
  if (blocks.length == 0) {
    return 0;
  }
  final int last = blocks.length - 1;
  //check if the last block is BlockInfoUnderConstruction
  BlockInfo lastBlk = blocks[last];
  long size = lastBlk.getNumBytes();
  // the last committed block is not complete, so it's bytes may be ignored.
  if (!lastBlk.isComplete()) {
     if (!includesLastUcBlock) {
       size = 0;
     } else if (usePreferredBlockSize4LastUcBlock) {
       size = isStriped()?
           getPreferredBlockSize() *
               ((BlockInfoStriped)lastBlk).getDataBlockNum() :
           getPreferredBlockSize();
     }
  }
  // The bytes of other committed blocks are calculated into the file length.
  for (int i = 0; i < last; i++) {
    size += blocks[i].getNumBytes();
  }
  return size;
} {code}
The bytes of one committed block will not be changed, so the bytes of the last 
committed block should be calculated into the file length too.

 

And the logic for committed blocks is mixed too when computing file length in 
DFSInputStream. Normally DFSInputStream doesn't get visible length for 
committed block regardless of whether the committed block is the last block or 
not.

 

HDFS-10843 noticed one bug which actually caused by the committed block, but 
HDFS-10843 fixed that bug in another way.

The num of bytes of the committed block will no longer change, so we should 
update the quota usage when the block is committed, which can reduce the delta 
quota usage in time.


> Logic for committed blocks is mixed when computing file size
> ------------------------------------------------------------
>
>                 Key: HDFS-17497
>                 URL: https://issues.apache.org/jira/browse/HDFS-17497
>             Project: Hadoop HDFS
>          Issue Type: Improvement
>            Reporter: ZanderXu
>            Priority: Major
>              Labels: pull-request-available
>
> One in-writing HDFS file may contains multiple committed blocks, as follows 
> (assume one file contains three blocks):
> || ||Block 1||Block 2||Block 3||
> |Case 1|Complete|Commit|UnderConstruction|
> |Case 2|Complete|Commit|Commit|
> |Case 3|Commit|Commit|Commit|
>  
> But the logic for committed blocks is mixed when computing file size, it 
> ignores the bytes of the last committed block and contains the bytes of other 
> committed blocks.
> {code:java}
> public final long computeFileSize(boolean includesLastUcBlock,
>     boolean usePreferredBlockSize4LastUcBlock) {
>   if (blocks.length == 0) {
>     return 0;
>   }
>   final int last = blocks.length - 1;
>   //check if the last block is BlockInfoUnderConstruction
>   BlockInfo lastBlk = blocks[last];
>   long size = lastBlk.getNumBytes();
>   // the last committed block is not complete, so it's bytes may be ignored.
>   if (!lastBlk.isComplete()) {
>      if (!includesLastUcBlock) {
>        size = 0;
>      } else if (usePreferredBlockSize4LastUcBlock) {
>        size = isStriped()?
>            getPreferredBlockSize() *
>                ((BlockInfoStriped)lastBlk).getDataBlockNum() :
>            getPreferredBlockSize();
>      }
>   }
>   // The bytes of other committed blocks are calculated into the file length.
>   for (int i = 0; i < last; i++) {
>     size += blocks[i].getNumBytes();
>   }
>   return size;
> } {code}
> The bytes of one committed block will not be changed, so the bytes of the 
> last committed block should be calculated into the file length too.
>  
> And the logic for committed blocks is mixed too when computing file length in 
> DFSInputStream. Normally DFSInputStream does not need to get visible length 
> for committed block regardless of whether the committed block is the last 
> block or not.
>  
> -HDFS-10843- encountered one bug which actually caused by the committed 
> block, but -HDFS-10843- fixed that bug by updating quota usage when 
> completing block. The num of bytes of the committed block will no longer 
> change, so we should update the quota usage when the block is committed, 
> which can reduce the delta quota usage in time.
>  
> So there are somethings we need to do:
>  * Unify the calculation logic for all committed blocks in 
> {{computeFileSize}} of {{INodeFile}}
>  * Unify the calculation logic for all committed blocks in {{getFileLength}} 
> of {{DFSInputStream}}
>  * Update quota usage when committing block



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Updated] (HDFS-17497) Logic for committed blocks is mixed when computing file size

Reply via email to