[ 
https://issues.apache.org/jira/browse/HDFS-101?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12787030#action_12787030
 ] 

Hairong Kuang commented on HDFS-101:
------------------------------------

Assume that there is a pipeline consisting of DN0,  ..., DNi, ..., where DN0 is 
the closest to the client, here is the plan for handling errors detected by DNi:
1. If the error occurs when communicating with DNi+1, send an ack indicating 
that DNi+1 failed and then shut down both block receiver and ack responder.
2. If the error is caused by DNi itself, simply shut down both block receiver 
and ack responder. Shutting down the block receiver causes the connections to 
DNi-1 to be closed, therefore DNi-1 will detect that DNi is failed immediately.
3. If the error is caused by DNi-1, handle it the same as 2.

Errors may be detected by either the block receiver or the ack responder. No 
matter who detects it, it needs to notify the other one of the error so the 
other one will stop and shut down itself.

> DFS write pipeline : DFSClient sometimes does not detect second datanode 
> failure 
> ---------------------------------------------------------------------------------
>
>                 Key: HDFS-101
>                 URL: https://issues.apache.org/jira/browse/HDFS-101
>             Project: Hadoop HDFS
>          Issue Type: Bug
>            Reporter: Raghu Angadi
>            Assignee: Hairong Kuang
>            Priority: Blocker
>             Fix For: 0.21.0
>
>
> When the first datanode's write to second datanode fails or times out 
> DFSClient ends up marking first datanode as the bad one and removes it from 
> the pipeline. Similar problem exists on DataNode as well and it is fixed in 
> HADOOP-3339. From HADOOP-3339 : 
> "The main issue is that BlockReceiver thread (and DataStreamer in the case of 
> DFSClient) interrupt() the 'responder' thread. But interrupting is a pretty 
> coarse control. We don't know what state the responder is in and interrupting 
> has different effects depending on responder state. To fix this properly we 
> need to redesign how we handle these interactions."
> When the first datanode closes its socket from DFSClient, DFSClient should 
> properly read all the data left in the socket.. Also, DataNode's closing of 
> the socket should not result in a TCP reset, otherwise I think DFSClient will 
> not be able to read from the socket.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to