[ https://issues.apache.org/jira/browse/HDFS-10714?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15543529#comment-15543529 ]
Yongjun Zhang commented on HDFS-10714: -------------------------------------- Hi [~vinayrpet], Thanks for your work here and sorry for the late reply. I agree with Kihwal's comment that "Guessing who is faulty is complicated". The idea of the patch described at https://issues.apache.org/jira/browse/HDFS-10714?focusedCommentId=15467559&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15467559 seems to be a reasonable approach, but questions about the following scenarios: Initial pipeline DN1 -> DN2 -> DN3 Scenario 1. Detecting possibly network issue with DN2's output 1.1 DN3 reported checksum error 1.2 DN2 checked itself, and saw it's good 1.3 Treat DN3 as bad, DN3 is replaced with DN4 1.4 DN1 -> DN2 -> DN4 1.5 DN4 reports checksum error, 1.6 Treat DN2 as bad Question 1, do we allow DN3 and DN4 to be added back to be available DNs for later recovery? In theory we should. Scenario 2. Detect data corruption at DN2 (this is like what's reported in HDFS-6937) 2.1 DN3 reported checksum error 2.2 DN2 checked itself, and saw it's bad, reports checksum error 2.3 DN1 checked itself, and saw it's good, 2.4 treat DN2 as bad Question 2, is this how it works? do we add back DN3 to be available for later recovery? Scenario 3. 3.1 DN3 reported checksum error 3.2 DN2 checked itself, and saw it's bad 3.3 DN1 checked itself, and saw it's bad 3.4 treat DN1 as bad Question 3, is this how it works? and do we have DN2 and DN3 available for use by later recovery? Thanks. > Issue in handling checksum errors in write pipeline when fault DN is > LAST_IN_PIPELINE > ------------------------------------------------------------------------------------- > > Key: HDFS-10714 > URL: https://issues.apache.org/jira/browse/HDFS-10714 > Project: Hadoop HDFS > Issue Type: Bug > Reporter: Brahma Reddy Battula > Assignee: Brahma Reddy Battula > Attachments: HDFS-10714-01-draft.patch > > > We had come across one issue, where write is failed even 7 DN’s are available > due to network fault at one datanode which is LAST_IN_PIPELINE. It will be > similar to HDFS-6937 . > Scenario : (DN3 has N/W Fault and Min repl=2). > Write pipeline: > DN1->DN2->DN3 => DN3 Gives ERROR_CHECKSUM ack. And so DN2 marked as bad > DN1->DN4-> DN3 => DN3 Gives ERROR_CHECKSUM ack. And so DN4 is marked as bad > …. > And so on ( all the times DN3 is LAST_IN_PIPELINE) ... Continued till no more > datanodes to construct the pipeline. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org