[ 
https://issues.apache.org/jira/browse/HDFS-10714?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15542658#comment-15542658
 ] 

Kihwal Lee commented on HDFS-10714:
-----------------------------------

+1 on DN remembering what it did during a recovery and being more adaptive.

Unconditionally removing DN3 first might not be a good idea, since this is the 
only one did checksum verification. The data on this node upto the ACKed bytes 
is very likely good (it still could have wrong data on disk). In majority of 
cases I have analyzed in the past, this would hurt than help. Sure it might be 
at fault, but seems too harsh to remove it first.  Perhaps instead of 
statically removing one node, DN should perform further diagnostics.  The 
client could try different node ordering in the pipeline before removing any 
node. We could also add a feature to tell all DNs in the pipeline to do 
checksum verification in the middle of a block write (is per packet switch 
possible?). If the errors from these propagate properly to the client, it will 
be able to make a more informed decision and avoid blaming a wrong node.

Of course, this won't be perfect either. We also see checksum problems during 
dfs write stemming from faulty clients. The clients having OOM is the most 
common ones. These are irrecoverable. While we are at the subject of write 
pipelines, the transferBlock ops during replication is even worse since ACK is 
practically turned off. A node with a faulty NIC can do some damage there.  But 
that's outside the scope of this jira.

> Issue in handling checksum errors in write pipeline when fault DN is 
> LAST_IN_PIPELINE
> -------------------------------------------------------------------------------------
>
>                 Key: HDFS-10714
>                 URL: https://issues.apache.org/jira/browse/HDFS-10714
>             Project: Hadoop HDFS
>          Issue Type: Bug
>            Reporter: Brahma Reddy Battula
>            Assignee: Brahma Reddy Battula
>         Attachments: HDFS-10714-01-draft.patch
>
>
> We had come across one issue, where write is failed even 7 DN’s are available 
> due to network fault at one datanode which is LAST_IN_PIPELINE. It will be 
> similar to HDFS-6937 .
> Scenario : (DN3 has N/W Fault and Min repl=2).
> Write pipeline:
> DN1->DN2->DN3  => DN3 Gives ERROR_CHECKSUM ack. And so DN2 marked as bad
> DN1->DN4-> DN3 => DN3 Gives ERROR_CHECKSUM ack. And so DN4 is marked as bad
> ….
> And so on ( all the times DN3 is LAST_IN_PIPELINE) ... Continued till no more 
> datanodes to construct the pipeline.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

Reply via email to