[
https://issues.apache.org/jira/browse/HDFS-11445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16200783#comment-16200783
]
Wei-Chiu Chuang commented on HDFS-11445:
----------------------------------------
For future jira explorers:
while backporting HDFS-11445 into CDH, our internal testing caught a regression
in it. After tracing the code, I realized the regression is fixed via
HDFS-11445.
Specifically, we found (bogus) missing file warnings when running a Solr
example application via Hue.
What's interesting is that the JMX shows MissingBlocks > 0, but there is no
missing file names; NameNode Web UI also warnings for missing blocks. But fsck
result is healthy.
Steps to reproduce:
1. Install a fresh CDH + CM cluster, 4 nodes (with HDFS-11445).
2. Go to Hue UI, install Solr example.
3. Restart CDH (all services)
For details, this bug seems to happen after writing to a data pipeline. A
dfsclient calls FSNamesystem#updatePipeline after it gets acks from the
pipeline. However, if FSNamesystem #updatePipeline before DataNodes report
IBRs, the block would see zero live replicas (it incorrectly thinks all
replicas are stale after the genstamp is updated).
I checked and compared the code in BlockManager#removeStoredBlock and
BlockManager#addStoredBlock.
In removeStoredBlock, in addition to removing a DN from a stored block,
BlockManager also updates under replication queue (updateNeededReplications);
but in addStoredBlock, it does not update under replication queue after adding
a DN to a stored block if the file is under construction.
The fix in HDFS-11755 adds an additional check in
BlockManager#removeStoredBlock to skip updating under replication queue when
the file is under construction, which fixes the problem.
Since HDFS-11755 is committed in branch 2.8 ~ trunk before HDFS-11445, this bug
is not seen in these branches. But we should backport HDFS-11755 in branch 2.7
to address the regression.
> FSCK shows overall health stauts as corrupt even one replica is corrupt
> -----------------------------------------------------------------------
>
> Key: HDFS-11445
> URL: https://issues.apache.org/jira/browse/HDFS-11445
> Project: Hadoop HDFS
> Issue Type: Bug
> Reporter: Brahma Reddy Battula
> Assignee: Brahma Reddy Battula
> Priority: Critical
> Fix For: 2.9.0, 2.7.4, 3.0.0-alpha4, 2.8.2
>
> Attachments: HDFS-11445-002.patch, HDFS-11445-003.patch,
> HDFS-11445-004.patch, HDFS-11445-005.patch, HDFS-11445-branch-2.7-002.patch,
> HDFS-11445-branch-2.7.patch, HDFS-11445-branch-2.patch, HDFS-11445.patch
>
>
> In the following scenario,FSCK shows overall health status as corrupt even
> it's has one good replica.
> 1. Create file with 2 RF.
> 2. Shutdown one DN
> 3. Append to file again.
> 4. Restart the DN
> 5. After block report, check Fsck
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]