[ 
https://issues.apache.org/jira/browse/HDFS-101?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12792261#action_12792261
 ] 

Todd Lipcon commented on HDFS-101:
----------------------------------

I just tried changing that if statement to == 0 instead of > 0, and it seems to 
have fixed the bug for me. I reran the above test and it successfully ejected 
the full node:

09/12/17 19:59:21 WARN hdfs.DFSClient: DFSOutputStream ResponseProcessor 
exception  for block blk_-1132852588861426806_1405java.io.IOException: Bad 
response 1 for block blk_-1132852588861426806_1405 from datanode 
10.251.43.82:50010
        at 
org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$ResponseProcessor.run(DFSClient.java:2427)

09/12/17 19:59:21 WARN hdfs.DFSClient: Error Recovery for block 
blk_-1132852588861426806_1405 bad datanode[1] 10.251.43.82:50010
09/12/17 19:59:21 WARN hdfs.DFSClient: Error Recovery for block 
blk_-1132852588861426806_1405 in pipeline 10.250.7.148:50010, 
10.251.43.82:50010, 10.251.66.212:50010: bad datanode 10.251.43.82:50010

Is it possible that clientName.length() would ever be equal to t0 in 
handleMirrorOutError? I'm new to this area of the code, so I may be missing 
something, but as I understand it, clientName.length() is 0 only for 
inter-datanode replication requests. For those writes, I didn't think any 
pipelining was done.

> DFS write pipeline : DFSClient sometimes does not detect second datanode 
> failure 
> ---------------------------------------------------------------------------------
>
>                 Key: HDFS-101
>                 URL: https://issues.apache.org/jira/browse/HDFS-101
>             Project: Hadoop HDFS
>          Issue Type: Bug
>    Affects Versions: 0.20.1
>            Reporter: Raghu Angadi
>            Assignee: Hairong Kuang
>            Priority: Blocker
>             Fix For: 0.21.0
>
>         Attachments: detectDownDN-0.20.patch, detectDownDN.patch, 
> detectDownDN1.patch, hdfs-101.tar.gz
>
>
> When the first datanode's write to second datanode fails or times out 
> DFSClient ends up marking first datanode as the bad one and removes it from 
> the pipeline. Similar problem exists on DataNode as well and it is fixed in 
> HADOOP-3339. From HADOOP-3339 : 
> "The main issue is that BlockReceiver thread (and DataStreamer in the case of 
> DFSClient) interrupt() the 'responder' thread. But interrupting is a pretty 
> coarse control. We don't know what state the responder is in and interrupting 
> has different effects depending on responder state. To fix this properly we 
> need to redesign how we handle these interactions."
> When the first datanode closes its socket from DFSClient, DFSClient should 
> properly read all the data left in the socket.. Also, DataNode's closing of 
> the socket should not result in a TCP reset, otherwise I think DFSClient will 
> not be able to read from the socket.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to