[ 
https://issues.apache.org/jira/browse/HDFS-101?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12792274#action_12792274
 ] 

Todd Lipcon commented on HDFS-101:
----------------------------------

As a second test of the above modification, I started uploading a 1G file, then 
forceably killed the DN on 10.250.7.148:

09/12/17 20:14:53 WARN hdfs.DFSClient: DFSOutputStream ResponseProcessor 
exception  for block blk_-8026763677133524198_1407java.io.IOException: Bad 
response 1 for block blk_-8026763677133524198_1407 from datanode 
10.251.66.212:50010
        at 
org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$ResponseProcessor.run(DFSClient.java:2427)

09/12/17 20:14:53 WARN hdfs.DFSClient: Error Recovery for block 
blk_-8026763677133524198_1407 bad datanode[2] 10.251.66.212:50010
09/12/17 20:14:53 WARN hdfs.DFSClient: Error Recovery for block 
blk_-8026763677133524198_1407 in pipeline 10.250.7.148:50010, 
10.251.43.82:50010, 10.251.66.212:50010: bad datanode 10.251.66.212:50010
09/12/17 20:14:54 INFO hdfs.DFSClient: Exception in createBlockOutputStream 
java.io.IOException: Bad connect ack with firstBadLink 10.251.66.212:50010
09/12/17 20:14:54 INFO hdfs.DFSClient: Abandoning block 
blk_-3750676278765626865_1408
09/12/17 20:15:00 INFO hdfs.DFSClient: Exception in createBlockOutputStream 
java.io.IOException: Bad connect ack with firstBadLink 10.251.66.212:50010
09/12/17 20:15:00 INFO hdfs.DFSClient: Abandoning block 
blk_7561780221358446528_1408
09/12/17 20:15:06 INFO hdfs.DFSClient: Exception in createBlockOutputStream 
java.io.IOException: Bad connect ack with firstBadLink 10.251.66.212:50010
09/12/17 20:15:06 INFO hdfs.DFSClient: Abandoning block 
blk_-8059177057921476468_1408
09/12/17 20:15:12 INFO hdfs.DFSClient: Exception in createBlockOutputStream 
java.net.ConnectException: Connection refused
09/12/17 20:15:12 INFO hdfs.DFSClient: Abandoning block 
blk_-8264633252613228869_1408
09/12/17 20:15:18 WARN hdfs.DFSClient: DataStreamer Exception: 
java.io.IOException: Unable to create new block.
        at 
org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.nextBlockOutputStream(DFSClient.java:2818)
        at 
org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$2000(DFSClient.java:2078)
        at 
org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2264)

09/12/17 20:15:18 WARN hdfs.DFSClient: Error Recovery for block 
blk_-8264633252613228869_1408 bad datanode[0] nodes == null
09/12/17 20:15:18 WARN hdfs.DFSClient: Could not get block locations. Source 
file "/user/root/1261098884" - Aborting...
put: Connection refused
09/12/17 20:15:18 ERROR hdfs.DFSClient: Exception closing file 
/user/root/1261098884 : java.net.ConnectException: Connection refused
java.net.ConnectException: Connection refused
        at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
        at 
sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:574)
        at 
org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:206)
        at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:404)
        at 
org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.createBlockOutputStream(DFSClient.java:2843)
        at 
org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.nextBlockOutputStream(DFSClient.java:2799)
        at 
org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$2000(DFSClient.java:2078)
        at 
org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2264)

As you can see above, it correctly detected the down DN. But the second block 
of the file failed to write (the file left on HDFS at the end was exactly 
128M). fsck -openforwrite shows that the file is still open:

OPENFORWRITE: ........./user/root/1261098884 134217728 bytes, 1 block(s), 
OPENFORWRITE: 
/user/root/1261098884:  Under replicated blk_-8026763677133524198_1408. Target 
Replicas is 3 but found 2 replica(s).


> DFS write pipeline : DFSClient sometimes does not detect second datanode 
> failure 
> ---------------------------------------------------------------------------------
>
>                 Key: HDFS-101
>                 URL: https://issues.apache.org/jira/browse/HDFS-101
>             Project: Hadoop HDFS
>          Issue Type: Bug
>    Affects Versions: 0.20.1
>            Reporter: Raghu Angadi
>            Assignee: Hairong Kuang
>            Priority: Blocker
>             Fix For: 0.21.0
>
>         Attachments: detectDownDN-0.20.patch, detectDownDN.patch, 
> detectDownDN1-0.20.patch, detectDownDN1.patch, hdfs-101.tar.gz
>
>
> When the first datanode's write to second datanode fails or times out 
> DFSClient ends up marking first datanode as the bad one and removes it from 
> the pipeline. Similar problem exists on DataNode as well and it is fixed in 
> HADOOP-3339. From HADOOP-3339 : 
> "The main issue is that BlockReceiver thread (and DataStreamer in the case of 
> DFSClient) interrupt() the 'responder' thread. But interrupting is a pretty 
> coarse control. We don't know what state the responder is in and interrupting 
> has different effects depending on responder state. To fix this properly we 
> need to redesign how we handle these interactions."
> When the first datanode closes its socket from DFSClient, DFSClient should 
> properly read all the data left in the socket.. Also, DataNode's closing of 
> the socket should not result in a TCP reset, otherwise I think DFSClient will 
> not be able to read from the socket.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to