[ https://issues.apache.org/jira/browse/HDFS-10819?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15547214#comment-15547214 ]
Manoj Govindassamy commented on HDFS-10819: ------------------------------------------- [~andrew.wang], {quote} Also curious, would invalidation eventually fix this case, or is it truly stuck? {quote} * I find it totally stuck in this test case as we have only 3 DNs and the expected Replication factor is also 3. * Block invalidation was not going through and the replication factor failed to catch up. The reason why Block Invalidation at DataNode didn't go through was because the disk which held the block is already closed as it was removed. {noformat} 730 2016-10-04 15:52:30,709 WARN impl.FsDatasetImpl (FsDatasetImpl.java:invalidate(1990)) - Volume /Users/manoj/work/cdh-hadoop/hadoop-hdfs-project/hadoop-hdfs/target/test/data/dfs/data/data1/current is closed, ignore the deletion task for block ReplicaBeingWritten, blk_1073741825_1001, RBW 731 getNumBytes() = 512 732 getBytesOnDisk() = 512 733 getVisibleLength()= 512 734 getVolume() = /Users/manoj/work/cdh-hadoop/hadoop-hdfs-project/hadoop-hdfs/target/test/data/dfs/data/data1/current 735 getBlockFile() = /Users/manoj/work/cdh-hadoop/hadoop-hdfs-project/hadoop-hdfs/target/test/data/dfs/data/data1/current/BP-473099417-172.16.3.66-1475621545787/current/rbw/blk_1073741825 736 bytesAcked=512 737 bytesOnDisk=512 {noformat} The core fix here is letting {{BlockManager#addStoredBlockUnderConstruction}} invoke {{addStoredBlock}} for all Finalized blocks and let addStoredBlocks decide on (which is already happening) follow up actions of invalidations removal of corrupt replicas. [~andrew.wang], [~eddyxu], would like to hear your further thoughts on this. > BlockManager fails to store a good block for a datanode storage after it > reported a corrupt block — block replication stuck > --------------------------------------------------------------------------------------------------------------------------- > > Key: HDFS-10819 > URL: https://issues.apache.org/jira/browse/HDFS-10819 > Project: Hadoop HDFS > Issue Type: Bug > Components: hdfs > Affects Versions: 3.0.0-alpha1 > Reporter: Manoj Govindassamy > Assignee: Manoj Govindassamy > Attachments: HDFS-10819.001.patch > > > TestDataNodeHotSwapVolumes occasionally fails in the unit test > testRemoveVolumeBeingWrittenForDatanode. Data write pipeline can have issues > as there could be timeouts, data node not reachable etc, and in this test > case it was more of induced one as one of the volumes in a datanode is > removed while block write is in progress. Digging further in the logs, when > the problem happens in the write pipeline, the error recovery is not > happening as expected leading to block replication never catching up. > Though this problem has same signature as in HDFS-10780, from the logs it > looks like the code paths taken are totally different and so the root cause > could be different as well. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org