[ https://issues.apache.org/jira/browse/HADOOP-1955?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12531871 ]
dhruba borthakur commented on HADOOP-1955: ------------------------------------------ +1. Code looks good. It would be nice to have a unit test for this one. There should be a separate JIRA that allows detection & deletion of corrupted replicas. Can you pl file that one (if it does not already exists) and link it to this one? Thanks. > Corrupted block replication retries for ever > -------------------------------------------- > > Key: HADOOP-1955 > URL: https://issues.apache.org/jira/browse/HADOOP-1955 > Project: Hadoop > Issue Type: Bug > Components: dfs > Affects Versions: 0.14.1 > Reporter: Koji Noguchi > Assignee: Raghu Angadi > Priority: Blocker > Fix For: 0.14.2 > > Attachments: HADOOP-1955.patch > > > When replicating corrupted block, receiving side rejects the block due to > checksum error. Namenode keeps on retrying (with the same source datanode). > Fsck shows those blocks as under-replicated. > [Namenode log] > {noformat} > 2007-09-27 02:00:05,273 INFO org.apache.hadoop.dfs.StateChange: BLOCK* > NameSystem.heartbeatCheck: lost heartbeat from 99.2.99.111 > ... > 2007-09-27 02:01:02,618 INFO org.apache.hadoop.dfs.StateChange: BLOCK* > NameSystem.pendingTransfer: ask 99.9.99.11:9999 to replicate > blk_-5925066143536023890 to datanode(s) 99.9.99.37:9999 > 2007-09-27 02:10:03,843 WARN org.apache.hadoop.fs.FSNamesystem: > PendingReplicationMonitor timed out block blk_-5925066143536023890 > 2007-09-27 02:10:08,248 INFO org.apache.hadoop.dfs.StateChange: BLOCK* > NameSystem.pendingTransfer: ask 99.9.99.11:9999 to replicate > blk_-5925066143536023890 to datanode(s) 99.9.99.35:9999 > 2007-09-27 02:20:03,848 WARN org.apache.hadoop.fs.FSNamesystem: > PendingReplicationMonitor timed out block blk_-5925066143536023890 > 2007-09-27 02:20:08,646 INFO org.apache.hadoop.dfs.StateChange: BLOCK* > NameSystem.pendingTransfer: ask 99.9.99.11:9999 to replicate > blk_-5925066143536023890 to datanode(s) 99.9.99.19:9999 > (repeats) > {noformat} > [Datanode(sender) 99.9.99.11 log] > {noformat} > 2007-09-27 02:01:04,493 INFO org.apache.hadoop.dfs.DataNode: Starting thread > to transfer block blk_-5925066143536023890 to > [Lorg.apache.hadoop.dfs.DatanodeInfo;@e58187 > 2007-09-27 02:01:05,153 WARN org.apache.hadoop.dfs.DataNode: Failed to > transfer blk_-5925066143536023890 to 74.6.128.37:50010 got > java.net.SocketException: Connection reset > at java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:96) > at java.net.SocketOutputStream.write(SocketOutputStream.java:136) > at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:65) > at java.io.BufferedOutputStream.write(BufferedOutputStream.java:109) > at java.io.DataOutputStream.write(DataOutputStream.java:90) > at org.apache.hadoop.dfs.DataNode.sendBlock(DataNode.java:1231) > at org.apache.hadoop.dfs.DataNode$DataTransfer.run(DataNode.java:1280) > at java.lang.Thread.run(Thread.java:619) > (repeats) > {noformat} > [Datanode(one of the receiver) 99.9.99.37 log] > {noformat} > 2007-09-27 02:01:05,150 ERROR org.apache.hadoop.dfs.DataNode: DataXceiver: > java.io.IOException: Unexpected checksum mismatch while writing > blk_-5925066143536023890 from /74.6.128.33:57605 > at org.apache.hadoop.dfs.DataNode$DataXceiver.writeBlock(DataNode.java:902) > at org.apache.hadoop.dfs.DataNode$DataXceiver.run(DataNode.java:727) > at java.lang.Thread.run(Thread.java:619) > {noformat} -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.