[ 
https://issues.apache.org/jira/browse/HDFS-101?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Todd Lipcon updated HDFS-101:
-----------------------------

    Attachment: hdfs-101-branch-0.20-append-cdh3.txt

Hey Nicolas,

I just compared our two patches side by side. The one I've been testing with 
(and made a noticeable improvement in recovery detecting the correct down node 
in cluster failure testing) is attached. Here are a few differences I noticed 
(though maybe because the diffs are against different trees):

- Looks like your patch doesn't maintain wire compat when mirrorError is true, 
since it constructs a "replies" list with only 2 elements (not based on the 
number of downstream nodes)
- When receiving packets in BlockReceiver, I am explicitly forwarding 
HEART_BEAT packets where it looks like you're not checking for them. Have you 
verified by leaving a connection open with no data flowing that heartbeats are 
handled properly in BlockReceiver?

> DFS write pipeline : DFSClient sometimes does not detect second datanode 
> failure 
> ---------------------------------------------------------------------------------
>
>                 Key: HDFS-101
>                 URL: https://issues.apache.org/jira/browse/HDFS-101
>             Project: Hadoop HDFS
>          Issue Type: Bug
>    Affects Versions: 0.20-append, 0.20.1
>            Reporter: Raghu Angadi
>            Assignee: Hairong Kuang
>            Priority: Blocker
>             Fix For: 0.20.2, 0.21.0
>
>         Attachments: detectDownDN-0.20.patch, detectDownDN1-0.20.patch, 
> detectDownDN2.patch, detectDownDN3-0.20-yahoo.patch, 
> detectDownDN3-0.20.patch, detectDownDN3.patch, 
> hdfs-101-branch-0.20-append-cdh3.txt, hdfs-101.tar.gz, 
> HDFS-101_20-append.patch, pipelineHeartbeat.patch, 
> pipelineHeartbeat_yahoo.patch
>
>
> When the first datanode's write to second datanode fails or times out 
> DFSClient ends up marking first datanode as the bad one and removes it from 
> the pipeline. Similar problem exists on DataNode as well and it is fixed in 
> HADOOP-3339. From HADOOP-3339 : 
> "The main issue is that BlockReceiver thread (and DataStreamer in the case of 
> DFSClient) interrupt() the 'responder' thread. But interrupting is a pretty 
> coarse control. We don't know what state the responder is in and interrupting 
> has different effects depending on responder state. To fix this properly we 
> need to redesign how we handle these interactions."
> When the first datanode closes its socket from DFSClient, DFSClient should 
> properly read all the data left in the socket.. Also, DataNode's closing of 
> the socket should not result in a TCP reset, otherwise I think DFSClient will 
> not be able to read from the socket.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to