[jira] [Commented] (HDFS-17342) Fix DataNode may invalidates normal block causing missing block

ASF GitHub Bot (Jira) Sun, 21 Jan 2024 18:35:29 -0800


    [ 
https://issues.apache.org/jira/browse/HDFS-17342?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17809215#comment-17809215
 ]


ASF GitHub Bot commented on HDFS-17342:
---------------------------------------

ZanderXu commented on PR #6464:
URL: https://github.com/apache/hadoop/pull/6464#issuecomment-1902959898

   > This is a bug fix after https://github.com/apache/hadoop/pull/5564 , do 
you have time to help review this?
   
   @smarthanwang I have a question about HDFS-16985, Normally 
FileNotFoundException means that the meta file or data file maybe lost, so the 
replication on this datanode maybe corrupt, right?  In your business(AWS EC2 + 
EBS) situation, you don't expect datanode to delete this replica directly, so 
HDFS-16985 just remove the replica from the memory of DN.
   
   But I want to see that DN should directly delete this corrupt replica If it 
can ensure that the replica is corrupt, such as: meta file or data file is 
lost. 
   So we can add a configure to control whether DN delete this replication from 
disk directly, such as: fs.datanode.delete.corrupt.replica.from.disk with a 
default value true.
   
   If `fs.datanode.delete.corrupt.replica.from.disk` is true, DN can delete 
this corrupt replica from disk directly. If 
`fs.datanode.delete.corrupt.replica.from.disk` is false, DN can just delete 
this corrupt replica from memory.
   
   @smarthanwang @zhangshuyan0 looking forward to your good ideas.
   




> Fix DataNode may invalidates normal block causing missing block
> ---------------------------------------------------------------
>
>                 Key: HDFS-17342
>                 URL: https://issues.apache.org/jira/browse/HDFS-17342
>             Project: Hadoop HDFS
>          Issue Type: Improvement
>          Components: datanode
>            Reporter: Haiyang Hu
>            Assignee: Haiyang Hu
>            Priority: Major
>              Labels: pull-request-available
>             Fix For: 3.5.0
>
>
> When users read an append file, occasional exceptions may occur, such as 
> org.apache.hadoop.hdfs.BlockMissingException: Could not obtain block: xxx.
> This can happen if one thread is reading the block while writer thread is 
> finalizing it simultaneously.
> *Root cause:*
> # The reader thread obtains a RBW replica from VolumeMap, such as: 
> blk_xxx_xxx[RBW] and  the data file should be in /XXX/rbw/blk_xxx.
> # Simultaneously, the writer thread will finalize this block, moving it from 
> the RBW directory to the FINALIZE directory. the data file is move from 
> /XXX/rbw/block_xxx to /XXX/finalize/block_xxx.
> # The reader thread attempts to open this data input stream but encounters a 
> FileNotFoundException because the data file /XXX/rbw/blk_xxx or meta file 
> /XXX/rbw/blk_xxx_xxx doesn't exist at this moment.
> # The reader thread  will treats this block as corrupt, removes the replica 
> from the volume map, and the DataNode reports the deleted block to the 
> NameNode.
> # The NameNode removes this replica for the block.
> # If the current file replication is 1, this file will cause a missing block 
> issue until this DataNode executes the DirectoryScanner again.
> As described above, when the reader thread encountered FileNotFoundException 
> is as expected, because the file is moved.
> So we need to add a double check to the invalidateMissingBlock logic to 
> verify whether the data file or meta file exists to avoid similar cases.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Commented] (HDFS-17342) Fix DataNode may invalidates normal block causing missing block

Reply via email to