[ https://issues.apache.org/jira/browse/HDFS-10627?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15384523#comment-15384523 ]
Daryn Sharp commented on HDFS-10627: ------------------------------------ Adding a feedback mechanism would be very useful, but should be a different jira. I'm sure it's harder than it seems. (I'm not sure why the packet responder isn't started. I think maybe as an optimization sometimes and/or suspect it may have to do with recovery needing to copy what appears to have tail corruption before truncating it. Not my area of expertise...) This jira however must restore prior behavior so our clusters can actually detect bad blocks. Latent corruption is going undetected. Legit/detected corruption is queued for days which increases risk of data loss. DNs are too busy verifying false positives from clients that didn't fully read the stream. > Volume Scanner mark a block as "suspect" even if the block sender encounters > 'Broken pipe' or 'Connection reset by peer' exception > ---------------------------------------------------------------------------------------------------------------------------------- > > Key: HDFS-10627 > URL: https://issues.apache.org/jira/browse/HDFS-10627 > Project: Hadoop HDFS > Issue Type: Bug > Components: hdfs > Affects Versions: 2.7.0 > Reporter: Rushabh S Shah > Assignee: Rushabh S Shah > Attachments: HDFS-10627.patch > > > In the BlockSender code, > {code:title=BlockSender.java|borderStyle=solid} > if (!ioem.startsWith("Broken pipe") && !ioem.startsWith("Connection > reset")) { > LOG.error("BlockSender.sendChunks() exception: ", e); > } > datanode.getBlockScanner().markSuspectBlock( > volumeRef.getVolume().getStorageID(), > block); > {code} > Before HDFS-7686, the block was marked as suspect only if the exception > message doesn't start with Broken pipe or Connection reset. > But after HDFS-7686, the block is marked as corrupt irrespective of the > exception message. > In one of our datanode, it took approximately a whole day (22 hours) to go > through all the suspect blocks to scan one corrupt block. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org