[jira] [Comment Edited] (HDFS-11445) FSCK shows overall health status as corrupt even one replica is corrupt

Wei-Chiu Chuang (JIRA) Wed, 11 Oct 2017 12:15:38 -0700

    [ 
https://issues.apache.org/jira/browse/HDFS-11445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16200783#comment-16200783
 ]


Wei-Chiu Chuang edited comment on HDFS-11445 at 10/11/17 7:14 PM:
------------------------------------------------------------------

For future jira explorers:
while backporting HDFS-11445 into CDH, our internal testing caught a regression 
in it. After tracing the code, I realized the regression is fixed via 
HDFS-11755.

Specifically, we found (bogus) missing file warnings when running a Solr 
example application via Hue.

What's interesting is that the JMX shows MissingBlocks > 0, but there is no 
missing file names; NameNode Web UI also warnings for missing blocks. But fsck 
result is healthy.

Steps to reproduce:

1. Install a fresh CDH + CM cluster, 4 nodes (with HDFS-11445).
2. Go to Hue UI, install Solr example.
3. Restart CDH (all services)

For details, this bug seems to happen after writing to a data pipeline. A 
dfsclient calls FSNamesystem#updatePipeline after it gets acks from the 
pipeline. However, if FSNamesystem #updatePipeline before DataNodes report 
IBRs, the block would see zero live replicas (it incorrectly thinks all 
replicas are stale after the genstamp is updated).

I checked and compared the code in BlockManager#removeStoredBlock and 
BlockManager#addStoredBlock.

In removeStoredBlock, in addition to removing a DN from a stored block, 
BlockManager also updates under replication queue (updateNeededReplications); 
but in addStoredBlock, it does not update under replication queue after adding 
a DN to a stored block if the file is under construction.

The fix in HDFS-11755 adds an additional check in 
BlockManager#removeStoredBlock to skip updating under replication queue when 
the file is under construction, which fixes the problem.

Since HDFS-11755 is committed in branch 2.8 ~ trunk before HDFS-11445, this bug 
is not seen in these branches. But we should backport HDFS-11755 in branch 2.7 
to address the regression.


was (Author: jojochuang):
For future jira explorers:
while backporting HDFS-11445 into CDH, our internal testing caught a regression 
in it. After tracing the code, I realized the regression is fixed via 
HDFS-11445.

Specifically, we found (bogus) missing file warnings when running a Solr 
example application via Hue.

What's interesting is that the JMX shows MissingBlocks > 0, but there is no 
missing file names; NameNode Web UI also warnings for missing blocks. But fsck 
result is healthy.

Steps to reproduce:

1. Install a fresh CDH + CM cluster, 4 nodes (with HDFS-11445).
2. Go to Hue UI, install Solr example.
3. Restart CDH (all services)

For details, this bug seems to happen after writing to a data pipeline. A 
dfsclient calls FSNamesystem#updatePipeline after it gets acks from the 
pipeline. However, if FSNamesystem #updatePipeline before DataNodes report 
IBRs, the block would see zero live replicas (it incorrectly thinks all 
replicas are stale after the genstamp is updated).

I checked and compared the code in BlockManager#removeStoredBlock and 
BlockManager#addStoredBlock.

In removeStoredBlock, in addition to removing a DN from a stored block, 
BlockManager also updates under replication queue (updateNeededReplications); 
but in addStoredBlock, it does not update under replication queue after adding 
a DN to a stored block if the file is under construction.

The fix in HDFS-11755 adds an additional check in 
BlockManager#removeStoredBlock to skip updating under replication queue when 
the file is under construction, which fixes the problem.

Since HDFS-11755 is committed in branch 2.8 ~ trunk before HDFS-11445, this bug 
is not seen in these branches. But we should backport HDFS-11755 in branch 2.7 
to address the regression.

> FSCK shows overall health status as corrupt even one replica is corrupt
> -----------------------------------------------------------------------
>
>                 Key: HDFS-11445
>                 URL: https://issues.apache.org/jira/browse/HDFS-11445
>             Project: Hadoop HDFS
>          Issue Type: Bug
>            Reporter: Brahma Reddy Battula
>            Assignee: Brahma Reddy Battula
>            Priority: Critical
>             Fix For: 2.9.0, 2.7.4, 3.0.0-alpha4, 2.8.2
>
>         Attachments: HDFS-11445-002.patch, HDFS-11445-003.patch, 
> HDFS-11445-004.patch, HDFS-11445-005.patch, HDFS-11445-branch-2.7-002.patch, 
> HDFS-11445-branch-2.7.patch, HDFS-11445-branch-2.patch, HDFS-11445.patch
>
>
> In the following scenario,FSCK shows overall health status as corrupt even 
> it's has one good replica.
> 1. Create file with 2 RF.
> 2. Shutdown one DN
> 3. Append to file again. 
> 4. Restart the DN
> 5. After block report, check Fsck



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Comment Edited] (HDFS-11445) FSCK shows overall health status as corrupt even one replica is corrupt

Reply via email to