[ 
https://issues.apache.org/jira/browse/HDFS-10627?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15384523#comment-15384523
 ] 

Daryn Sharp commented on HDFS-10627:
------------------------------------

Adding a feedback mechanism would be very useful, but should be a different 
jira.  I'm sure it's harder than it seems.  (I'm not sure why the packet 
responder isn't started.  I think maybe as an optimization sometimes and/or 
suspect it may have to do with recovery needing to copy what appears to have 
tail corruption before truncating it.  Not my area of expertise...)

This jira however must restore prior behavior so our clusters can actually 
detect bad blocks.  Latent corruption is going undetected.  Legit/detected 
corruption is queued for days which increases risk of data loss.   DNs are too 
busy verifying false positives from clients that didn't fully read the stream.

> Volume Scanner mark a block as "suspect" even if the block sender encounters 
> 'Broken pipe' or 'Connection reset by peer' exception
> ----------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HDFS-10627
>                 URL: https://issues.apache.org/jira/browse/HDFS-10627
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: hdfs
>    Affects Versions: 2.7.0
>            Reporter: Rushabh S Shah
>            Assignee: Rushabh S Shah
>         Attachments: HDFS-10627.patch
>
>
> In the BlockSender code,
> {code:title=BlockSender.java|borderStyle=solid}
>         if (!ioem.startsWith("Broken pipe") && !ioem.startsWith("Connection 
> reset")) {
>           LOG.error("BlockSender.sendChunks() exception: ", e);
>         }
>         datanode.getBlockScanner().markSuspectBlock(
>               volumeRef.getVolume().getStorageID(),
>               block);
> {code}
> Before HDFS-7686, the block was marked as suspect only if the exception 
> message doesn't start with Broken pipe or Connection reset.
> But after HDFS-7686, the block is marked as corrupt irrespective of the 
> exception message.
> In one of our datanode, it took approximately a whole day (22 hours) to go 
> through all the suspect blocks to scan one corrupt block.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

Reply via email to