[ https://issues.apache.org/jira/browse/HDFS-17342?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17809215#comment-17809215 ]
ASF GitHub Bot commented on HDFS-17342: --------------------------------------- ZanderXu commented on PR #6464: URL: https://github.com/apache/hadoop/pull/6464#issuecomment-1902959898 > This is a bug fix after https://github.com/apache/hadoop/pull/5564 , do you have time to help review this? @smarthanwang I have a question about HDFS-16985, Normally FileNotFoundException means that the meta file or data file maybe lost, so the replication on this datanode maybe corrupt, right? In your business(AWS EC2 + EBS) situation, you don't expect datanode to delete this replica directly, so HDFS-16985 just remove the replica from the memory of DN. But I want to see that DN should directly delete this corrupt replica If it can ensure that the replica is corrupt, such as: meta file or data file is lost. So we can add a configure to control whether DN delete this replication from disk directly, such as: fs.datanode.delete.corrupt.replica.from.disk with a default value true. If `fs.datanode.delete.corrupt.replica.from.disk` is true, DN can delete this corrupt replica from disk directly. If `fs.datanode.delete.corrupt.replica.from.disk` is false, DN can just delete this corrupt replica from memory. @smarthanwang @zhangshuyan0 looking forward to your good ideas. > Fix DataNode may invalidates normal block causing missing block > --------------------------------------------------------------- > > Key: HDFS-17342 > URL: https://issues.apache.org/jira/browse/HDFS-17342 > Project: Hadoop HDFS > Issue Type: Improvement > Components: datanode > Reporter: Haiyang Hu > Assignee: Haiyang Hu > Priority: Major > Labels: pull-request-available > Fix For: 3.5.0 > > > When users read an append file, occasional exceptions may occur, such as > org.apache.hadoop.hdfs.BlockMissingException: Could not obtain block: xxx. > This can happen if one thread is reading the block while writer thread is > finalizing it simultaneously. > *Root cause:* > # The reader thread obtains a RBW replica from VolumeMap, such as: > blk_xxx_xxx[RBW] and the data file should be in /XXX/rbw/blk_xxx. > # Simultaneously, the writer thread will finalize this block, moving it from > the RBW directory to the FINALIZE directory. the data file is move from > /XXX/rbw/block_xxx to /XXX/finalize/block_xxx. > # The reader thread attempts to open this data input stream but encounters a > FileNotFoundException because the data file /XXX/rbw/blk_xxx or meta file > /XXX/rbw/blk_xxx_xxx doesn't exist at this moment. > # The reader thread will treats this block as corrupt, removes the replica > from the volume map, and the DataNode reports the deleted block to the > NameNode. > # The NameNode removes this replica for the block. > # If the current file replication is 1, this file will cause a missing block > issue until this DataNode executes the DirectoryScanner again. > As described above, when the reader thread encountered FileNotFoundException > is as expected, because the file is moved. > So we need to add a double check to the invalidateMissingBlock logic to > verify whether the data file or meta file exists to avoid similar cases. -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org