[ 
https://issues.apache.org/jira/browse/HDFS-10627?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15378353#comment-15378353
 ] 

Daryn Sharp commented on HDFS-10627:
------------------------------------

Rushabh and I checked a few random healthy nodes on multiple clusters.
# They are backlogged with thousands of suspect blocks on all storages.  It 
will take days to catch up – assuming no more false positives.
# They been up for months but haven't started a rescan in over a month 
(available non-archived logs).  So obviously the false positives are trickling 
in faster than the the scan rate.
# 1 node reported one bad block in the past month.  The others have not 
reported any bad blocks.
# On a large cluster, 0.08% of pipeline recovery corruption was detected

_The scanner is completely negligent in its duty to find and report bad 
blocks_.  Rushabh added the priority scan feature to prevent rack failure from 
causing (hopefully) temporary data loss.  Instead of waiting for up to week for 
a suspected corrupt block to be reported, it would be reported almost 
immediately.  Well, guess what happened today?  Rack failed.  Lost data.  DN 
knew the block was bad but was too backlogged to verify and report it.  
Completely avoidable situation not worth detecting 0.08% of pipeline recovery 
corruptions.

*This is completely broken and must be reverted to be fixed*.


> Volume Scanner mark a block as "suspect" even if the block sender encounters 
> 'Broken pipe' or 'Connection reset by peer' exception
> ----------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HDFS-10627
>                 URL: https://issues.apache.org/jira/browse/HDFS-10627
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: hdfs
>    Affects Versions: 2.7.0
>            Reporter: Rushabh S Shah
>            Assignee: Rushabh S Shah
>
> In the BlockSender code,
> {code:title=BlockSender.java|borderStyle=solid}
>         if (!ioem.startsWith("Broken pipe") && !ioem.startsWith("Connection 
> reset")) {
>           LOG.error("BlockSender.sendChunks() exception: ", e);
>         }
>         datanode.getBlockScanner().markSuspectBlock(
>               volumeRef.getVolume().getStorageID(),
>               block);
> {code}
> Before HDFS-7686, the block was marked as suspect only if the exception 
> message doesn't start with Broken pipe or Connection reset.
> But after HDFS-7686, the block is marked as corrupt irrespective of the 
> exception message.
> In one of our datanode, it took approximately a whole day (22 hours) to go 
> through all the suspect blocks to scan one corrupt block.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

Reply via email to