[ https://issues.apache.org/jira/browse/HDFS-11445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16200783#comment-16200783 ]
Wei-Chiu Chuang edited comment on HDFS-11445 at 10/11/17 7:14 PM: ------------------------------------------------------------------ For future jira explorers: while backporting HDFS-11445 into CDH, our internal testing caught a regression in it. After tracing the code, I realized the regression is fixed via HDFS-11755. Specifically, we found (bogus) missing file warnings when running a Solr example application via Hue. What's interesting is that the JMX shows MissingBlocks > 0, but there is no missing file names; NameNode Web UI also warnings for missing blocks. But fsck result is healthy. Steps to reproduce: 1. Install a fresh CDH + CM cluster, 4 nodes (with HDFS-11445). 2. Go to Hue UI, install Solr example. 3. Restart CDH (all services) For details, this bug seems to happen after writing to a data pipeline. A dfsclient calls FSNamesystem#updatePipeline after it gets acks from the pipeline. However, if FSNamesystem #updatePipeline before DataNodes report IBRs, the block would see zero live replicas (it incorrectly thinks all replicas are stale after the genstamp is updated). I checked and compared the code in BlockManager#removeStoredBlock and BlockManager#addStoredBlock. In removeStoredBlock, in addition to removing a DN from a stored block, BlockManager also updates under replication queue (updateNeededReplications); but in addStoredBlock, it does not update under replication queue after adding a DN to a stored block if the file is under construction. The fix in HDFS-11755 adds an additional check in BlockManager#removeStoredBlock to skip updating under replication queue when the file is under construction, which fixes the problem. Since HDFS-11755 is committed in branch 2.8 ~ trunk before HDFS-11445, this bug is not seen in these branches. But we should backport HDFS-11755 in branch 2.7 to address the regression. was (Author: jojochuang): For future jira explorers: while backporting HDFS-11445 into CDH, our internal testing caught a regression in it. After tracing the code, I realized the regression is fixed via HDFS-11445. Specifically, we found (bogus) missing file warnings when running a Solr example application via Hue. What's interesting is that the JMX shows MissingBlocks > 0, but there is no missing file names; NameNode Web UI also warnings for missing blocks. But fsck result is healthy. Steps to reproduce: 1. Install a fresh CDH + CM cluster, 4 nodes (with HDFS-11445). 2. Go to Hue UI, install Solr example. 3. Restart CDH (all services) For details, this bug seems to happen after writing to a data pipeline. A dfsclient calls FSNamesystem#updatePipeline after it gets acks from the pipeline. However, if FSNamesystem #updatePipeline before DataNodes report IBRs, the block would see zero live replicas (it incorrectly thinks all replicas are stale after the genstamp is updated). I checked and compared the code in BlockManager#removeStoredBlock and BlockManager#addStoredBlock. In removeStoredBlock, in addition to removing a DN from a stored block, BlockManager also updates under replication queue (updateNeededReplications); but in addStoredBlock, it does not update under replication queue after adding a DN to a stored block if the file is under construction. The fix in HDFS-11755 adds an additional check in BlockManager#removeStoredBlock to skip updating under replication queue when the file is under construction, which fixes the problem. Since HDFS-11755 is committed in branch 2.8 ~ trunk before HDFS-11445, this bug is not seen in these branches. But we should backport HDFS-11755 in branch 2.7 to address the regression. > FSCK shows overall health status as corrupt even one replica is corrupt > ----------------------------------------------------------------------- > > Key: HDFS-11445 > URL: https://issues.apache.org/jira/browse/HDFS-11445 > Project: Hadoop HDFS > Issue Type: Bug > Reporter: Brahma Reddy Battula > Assignee: Brahma Reddy Battula > Priority: Critical > Fix For: 2.9.0, 2.7.4, 3.0.0-alpha4, 2.8.2 > > Attachments: HDFS-11445-002.patch, HDFS-11445-003.patch, > HDFS-11445-004.patch, HDFS-11445-005.patch, HDFS-11445-branch-2.7-002.patch, > HDFS-11445-branch-2.7.patch, HDFS-11445-branch-2.patch, HDFS-11445.patch > > > In the following scenario,FSCK shows overall health status as corrupt even > it's has one good replica. > 1. Create file with 2 RF. > 2. Shutdown one DN > 3. Append to file again. > 4. Restart the DN > 5. After block report, check Fsck -- This message was sent by Atlassian JIRA (v6.4.14#64029) --------------------------------------------------------------------- To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org