[ https://issues.apache.org/jira/browse/HDFS-10714?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15542658#comment-15542658 ]
Kihwal Lee commented on HDFS-10714: ----------------------------------- +1 on DN remembering what it did during a recovery and being more adaptive. Unconditionally removing DN3 first might not be a good idea, since this is the only one did checksum verification. The data on this node upto the ACKed bytes is very likely good (it still could have wrong data on disk). In majority of cases I have analyzed in the past, this would hurt than help. Sure it might be at fault, but seems too harsh to remove it first. Perhaps instead of statically removing one node, DN should perform further diagnostics. The client could try different node ordering in the pipeline before removing any node. We could also add a feature to tell all DNs in the pipeline to do checksum verification in the middle of a block write (is per packet switch possible?). If the errors from these propagate properly to the client, it will be able to make a more informed decision and avoid blaming a wrong node. Of course, this won't be perfect either. We also see checksum problems during dfs write stemming from faulty clients. The clients having OOM is the most common ones. These are irrecoverable. While we are at the subject of write pipelines, the transferBlock ops during replication is even worse since ACK is practically turned off. A node with a faulty NIC can do some damage there. But that's outside the scope of this jira. > Issue in handling checksum errors in write pipeline when fault DN is > LAST_IN_PIPELINE > ------------------------------------------------------------------------------------- > > Key: HDFS-10714 > URL: https://issues.apache.org/jira/browse/HDFS-10714 > Project: Hadoop HDFS > Issue Type: Bug > Reporter: Brahma Reddy Battula > Assignee: Brahma Reddy Battula > Attachments: HDFS-10714-01-draft.patch > > > We had come across one issue, where write is failed even 7 DN’s are available > due to network fault at one datanode which is LAST_IN_PIPELINE. It will be > similar to HDFS-6937 . > Scenario : (DN3 has N/W Fault and Min repl=2). > Write pipeline: > DN1->DN2->DN3 => DN3 Gives ERROR_CHECKSUM ack. And so DN2 marked as bad > DN1->DN4-> DN3 => DN3 Gives ERROR_CHECKSUM ack. And so DN4 is marked as bad > …. > And so on ( all the times DN3 is LAST_IN_PIPELINE) ... Continued till no more > datanodes to construct the pipeline. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org