[ 
https://issues.apache.org/jira/browse/HDFS-10627?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15378278#comment-15378278
 ] 

Daryn Sharp commented on HDFS-10627:
------------------------------------

I agree it's unfortunate there's no feedback mechanism.  I disagree the serving 
node should ever assume that loosing a client means the block _might_ be 
corrupt.  There are many reasons the client can "unexpectedly" close the 
connection.  Processes shutdown unexpectedly, get killed, etc.

The only time the DN should be suspect of its own block is a local IOE.  I 
checked some of our big clusters: a DN self-reporting a corrupt block during 
pipeline/block recovery is a fraction of a percent (which is being generous).

So let's consider...  Is it worth backlogging a DN with so many false positives 
from broken connections that:
# It takes a day to scan a legitimately bad block detected by a local IOE
# Leading to a rack failure causing temporary data loss
# Scanner isn't doing it's primary job of trawling the storage for bad blocks

> Volume Scanner mark a block as "suspect" even if the block sender encounters 
> 'Broken pipe' or 'Connection reset by peer' exception
> ----------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HDFS-10627
>                 URL: https://issues.apache.org/jira/browse/HDFS-10627
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: hdfs
>    Affects Versions: 2.7.0
>            Reporter: Rushabh S Shah
>            Assignee: Rushabh S Shah
>
> In the BlockSender code,
> {code:title=BlockSender.java|borderStyle=solid}
>         if (!ioem.startsWith("Broken pipe") && !ioem.startsWith("Connection 
> reset")) {
>           LOG.error("BlockSender.sendChunks() exception: ", e);
>         }
>         datanode.getBlockScanner().markSuspectBlock(
>               volumeRef.getVolume().getStorageID(),
>               block);
> {code}
> Before HDFS-7686, the block was marked as suspect only if the exception 
> message doesn't start with Broken pipe or Connection reset.
> But after HDFS-7686, the block is marked as corrupt irrespective of the 
> exception message.
> In one of our datanode, it took approximately a whole day (22 hours) to go 
> through all the suspect blocks to scan one corrupt block.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

Reply via email to