[ https://issues.apache.org/jira/browse/HDFS-10627?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15378278#comment-15378278 ]
Daryn Sharp commented on HDFS-10627: ------------------------------------ I agree it's unfortunate there's no feedback mechanism. I disagree the serving node should ever assume that loosing a client means the block _might_ be corrupt. There are many reasons the client can "unexpectedly" close the connection. Processes shutdown unexpectedly, get killed, etc. The only time the DN should be suspect of its own block is a local IOE. I checked some of our big clusters: a DN self-reporting a corrupt block during pipeline/block recovery is a fraction of a percent (which is being generous). So let's consider... Is it worth backlogging a DN with so many false positives from broken connections that: # It takes a day to scan a legitimately bad block detected by a local IOE # Leading to a rack failure causing temporary data loss # Scanner isn't doing it's primary job of trawling the storage for bad blocks > Volume Scanner mark a block as "suspect" even if the block sender encounters > 'Broken pipe' or 'Connection reset by peer' exception > ---------------------------------------------------------------------------------------------------------------------------------- > > Key: HDFS-10627 > URL: https://issues.apache.org/jira/browse/HDFS-10627 > Project: Hadoop HDFS > Issue Type: Bug > Components: hdfs > Affects Versions: 2.7.0 > Reporter: Rushabh S Shah > Assignee: Rushabh S Shah > > In the BlockSender code, > {code:title=BlockSender.java|borderStyle=solid} > if (!ioem.startsWith("Broken pipe") && !ioem.startsWith("Connection > reset")) { > LOG.error("BlockSender.sendChunks() exception: ", e); > } > datanode.getBlockScanner().markSuspectBlock( > volumeRef.getVolume().getStorageID(), > block); > {code} > Before HDFS-7686, the block was marked as suspect only if the exception > message doesn't start with Broken pipe or Connection reset. > But after HDFS-7686, the block is marked as corrupt irrespective of the > exception message. > In one of our datanode, it took approximately a whole day (22 hours) to go > through all the suspect blocks to scan one corrupt block. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org