[jira] [Commented] (HDFS-101) DFS write pipeline : DFSClient sometimes does not detect second datanode failure
[ https://issues.apache.org/jira/browse/HDFS-101?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13028607#comment-13028607 ] stack commented on HDFS-101: Do we know what issue fixed this failing test? We should backport it to branch-0.20-append branch. Thanks. > DFS write pipeline : DFSClient sometimes does not detect second datanode > failure > - > > Key: HDFS-101 > URL: https://issues.apache.org/jira/browse/HDFS-101 > Project: Hadoop HDFS > Issue Type: Bug > Components: data-node >Affects Versions: 0.20.1, 0.20-append >Reporter: Raghu Angadi >Assignee: Hairong Kuang >Priority: Blocker > Fix For: 0.20.2, 0.20-append, 0.21.0 > > Attachments: HDFS-101_20-append.patch, detectDownDN-0.20.patch, > detectDownDN1-0.20.patch, detectDownDN2.patch, > detectDownDN3-0.20-yahoo.patch, detectDownDN3-0.20.patch, > detectDownDN3.patch, hdfs-101-branch-0.20-append-cdh3.txt, hdfs-101.tar.gz, > pipelineHeartbeat.patch, pipelineHeartbeat_yahoo.patch > > > When the first datanode's write to second datanode fails or times out > DFSClient ends up marking first datanode as the bad one and removes it from > the pipeline. Similar problem exists on DataNode as well and it is fixed in > HADOOP-3339. From HADOOP-3339 : > "The main issue is that BlockReceiver thread (and DataStreamer in the case of > DFSClient) interrupt() the 'responder' thread. But interrupting is a pretty > coarse control. We don't know what state the responder is in and interrupting > has different effects depending on responder state. To fix this properly we > need to redesign how we handle these interactions." > When the first datanode closes its socket from DFSClient, DFSClient should > properly read all the data left in the socket.. Also, DataNode's closing of > the socket should not result in a TCP reset, otherwise I think DFSClient will > not be able to read from the socket. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-101) DFS write pipeline : DFSClient sometimes does not detect second datanode failure
[ https://issues.apache.org/jira/browse/HDFS-101?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13028588#comment-13028588 ] Hari commented on HDFS-101: --- Yes , that seems to be the problem . the client expects 3 replies in readfields while datanode is sending only 2 .. It is fixed in .21 .. > DFS write pipeline : DFSClient sometimes does not detect second datanode > failure > - > > Key: HDFS-101 > URL: https://issues.apache.org/jira/browse/HDFS-101 > Project: Hadoop HDFS > Issue Type: Bug > Components: data-node >Affects Versions: 0.20.1, 0.20-append >Reporter: Raghu Angadi >Assignee: Hairong Kuang >Priority: Blocker > Fix For: 0.20.2, 0.20-append, 0.21.0 > > Attachments: HDFS-101_20-append.patch, detectDownDN-0.20.patch, > detectDownDN1-0.20.patch, detectDownDN2.patch, > detectDownDN3-0.20-yahoo.patch, detectDownDN3-0.20.patch, > detectDownDN3.patch, hdfs-101-branch-0.20-append-cdh3.txt, hdfs-101.tar.gz, > pipelineHeartbeat.patch, pipelineHeartbeat_yahoo.patch > > > When the first datanode's write to second datanode fails or times out > DFSClient ends up marking first datanode as the bad one and removes it from > the pipeline. Similar problem exists on DataNode as well and it is fixed in > HADOOP-3339. From HADOOP-3339 : > "The main issue is that BlockReceiver thread (and DataStreamer in the case of > DFSClient) interrupt() the 'responder' thread. But interrupting is a pretty > coarse control. We don't know what state the responder is in and interrupting > has different effects depending on responder state. To fix this properly we > need to redesign how we handle these interactions." > When the first datanode closes its socket from DFSClient, DFSClient should > properly read all the data left in the socket.. Also, DataNode's closing of > the socket should not result in a TCP reset, otherwise I think DFSClient will > not be able to read from the socket. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-101) DFS write pipeline : DFSClient sometimes does not detect second datanode failure
[ https://issues.apache.org/jira/browse/HDFS-101?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13027104#comment-13027104 ] Hairong Kuang commented on HDFS-101: By simply looking at your client-side log, it seems to me that the datanode sent an ack with two fields but the client expects an ack with 3 fields. Could you please create a jira and post logs there? Thanks! > DFS write pipeline : DFSClient sometimes does not detect second datanode > failure > - > > Key: HDFS-101 > URL: https://issues.apache.org/jira/browse/HDFS-101 > Project: Hadoop HDFS > Issue Type: Bug > Components: data-node >Affects Versions: 0.20.1, 0.20-append >Reporter: Raghu Angadi >Assignee: Hairong Kuang >Priority: Blocker > Fix For: 0.20.2, 0.20-append, 0.21.0 > > Attachments: HDFS-101_20-append.patch, detectDownDN-0.20.patch, > detectDownDN1-0.20.patch, detectDownDN2.patch, > detectDownDN3-0.20-yahoo.patch, detectDownDN3-0.20.patch, > detectDownDN3.patch, hdfs-101-branch-0.20-append-cdh3.txt, hdfs-101.tar.gz, > pipelineHeartbeat.patch, pipelineHeartbeat_yahoo.patch > > > When the first datanode's write to second datanode fails or times out > DFSClient ends up marking first datanode as the bad one and removes it from > the pipeline. Similar problem exists on DataNode as well and it is fixed in > HADOOP-3339. From HADOOP-3339 : > "The main issue is that BlockReceiver thread (and DataStreamer in the case of > DFSClient) interrupt() the 'responder' thread. But interrupting is a pretty > coarse control. We don't know what state the responder is in and interrupting > has different effects depending on responder state. To fix this properly we > need to redesign how we handle these interactions." > When the first datanode closes its socket from DFSClient, DFSClient should > properly read all the data left in the socket.. Also, DataNode's closing of > the socket should not result in a TCP reset, otherwise I think DFSClient will > not be able to read from the socket. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-101) DFS write pipeline : DFSClient sometimes does not detect second datanode failure
[ https://issues.apache.org/jira/browse/HDFS-101?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13027003#comment-13027003 ] Hari commented on HDFS-101: --- TestFileAppend4 is failing randomnly even with this patch in 20-append branch .. We start 3 dns in the MiniDfsCluster . When the second datanode is killed , the first datanode correctly sends the ack as SUCCESS(for itself) FAILED(2nd dn) to the client .. But client is getting an EOFException while reading this ack from the first dn and hence incorrectly assumes the first dn as bad .. As a result , first dn (which is fine) is removed from the pipeline and eventually 2nd dn ll also be removed during recovery . now only 1 replica (3rd dn) is remaining for the block and the assertion fails .. While the 1st dn is fine and is sending ack correctly to the client , it is not clear why he is getting EOFException . Is this normal ?? The relevant part of the log is : ".. 11-04-27 14:48:16,796 DEBUG hdfs.DFSClient (DFSClient.java:run(2445)) - DataStreamer block blk_-1353652352417823279_1001 wrote packet seqno:-1 size:25 offsetInBlock:0 lastPacketInBlock:false 2011-04-27 14:48:16,796 DEBUG datanode.DataNode (BlockReceiver.java:run(891)) - PacketResponder 2 for block blk_-1353652352417823279_1001 responded an ack: Replies for seqno -1 are SUCCESS FAILED 2011-04-27 14:48:16,796 WARN hdfs.DFSClient (DFSClient.java:run(2596)) - DFSOutputStream ResponseProcessor exception for block blk_-1353652352417823279_1001java.io.EOFException at java.io.DataInputStream.readShort(Unknown Source) at org.apache.hadoop.hdfs.protocol.DataTransferProtocol$PipelineAck.readFields(DataTransferProtocol.java:125) at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$ResponseProcessor.run(DFSClient.java:2548) 2011-04-27 14:48:16,796 DEBUG datanode.DataNode (BlockReceiver.java:run(789)) - PacketResponder 2 seqno = -2 for block blk_-1353652352417823279_1001 waiting for local datanode to finish write. 2011-04-27 14:48:16,796 WARN hdfs.DFSClient (DFSClient.java:processDatanodeError(2632)) - Error Recovery for block blk_-1353652352417823279_1001 bad datanode[0] 127.0.0.1:4956 " > DFS write pipeline : DFSClient sometimes does not detect second datanode > failure > - > > Key: HDFS-101 > URL: https://issues.apache.org/jira/browse/HDFS-101 > Project: Hadoop HDFS > Issue Type: Bug > Components: data-node >Affects Versions: 0.20.1, 0.20-append >Reporter: Raghu Angadi >Assignee: Hairong Kuang >Priority: Blocker > Fix For: 0.20.2, 0.20-append, 0.21.0 > > Attachments: HDFS-101_20-append.patch, detectDownDN-0.20.patch, > detectDownDN1-0.20.patch, detectDownDN2.patch, > detectDownDN3-0.20-yahoo.patch, detectDownDN3-0.20.patch, > detectDownDN3.patch, hdfs-101-branch-0.20-append-cdh3.txt, hdfs-101.tar.gz, > pipelineHeartbeat.patch, pipelineHeartbeat_yahoo.patch > > > When the first datanode's write to second datanode fails or times out > DFSClient ends up marking first datanode as the bad one and removes it from > the pipeline. Similar problem exists on DataNode as well and it is fixed in > HADOOP-3339. From HADOOP-3339 : > "The main issue is that BlockReceiver thread (and DataStreamer in the case of > DFSClient) interrupt() the 'responder' thread. But interrupting is a pretty > coarse control. We don't know what state the responder is in and interrupting > has different effects depending on responder state. To fix this properly we > need to redesign how we handle these interactions." > When the first datanode closes its socket from DFSClient, DFSClient should > properly read all the data left in the socket.. Also, DataNode's closing of > the socket should not result in a TCP reset, otherwise I think DFSClient will > not be able to read from the socket. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (HDFS-101) DFS write pipeline : DFSClient sometimes does not detect second datanode failure
[ https://issues.apache.org/jira/browse/HDFS-101?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12879439#action_12879439 ] Nicolas Spiegelberg commented on HDFS-101: -- Todd, you're assumption is correct. I needed a couple small things from the HDFS-793 patch (namely, getNumOfReplies) to make HDFS-101 compatible with HDFS-872. > DFS write pipeline : DFSClient sometimes does not detect second datanode > failure > - > > Key: HDFS-101 > URL: https://issues.apache.org/jira/browse/HDFS-101 > Project: Hadoop HDFS > Issue Type: Bug >Affects Versions: 0.20-append, 0.20.1 >Reporter: Raghu Angadi >Assignee: Hairong Kuang >Priority: Blocker > Fix For: 0.20.2, 0.21.0 > > Attachments: detectDownDN-0.20.patch, detectDownDN1-0.20.patch, > detectDownDN2.patch, detectDownDN3-0.20-yahoo.patch, > detectDownDN3-0.20.patch, detectDownDN3.patch, hdfs-101.tar.gz, > HDFS-101_20-append.patch, pipelineHeartbeat.patch, > pipelineHeartbeat_yahoo.patch > > > When the first datanode's write to second datanode fails or times out > DFSClient ends up marking first datanode as the bad one and removes it from > the pipeline. Similar problem exists on DataNode as well and it is fixed in > HADOOP-3339. From HADOOP-3339 : > "The main issue is that BlockReceiver thread (and DataStreamer in the case of > DFSClient) interrupt() the 'responder' thread. But interrupting is a pretty > coarse control. We don't know what state the responder is in and interrupting > has different effects depending on responder state. To fix this properly we > need to redesign how we handle these interactions." > When the first datanode closes its socket from DFSClient, DFSClient should > properly read all the data left in the socket.. Also, DataNode's closing of > the socket should not result in a TCP reset, otherwise I think DFSClient will > not be able to read from the socket. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HDFS-101) DFS write pipeline : DFSClient sometimes does not detect second datanode failure
[ https://issues.apache.org/jira/browse/HDFS-101?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12876800#action_12876800 ] Todd Lipcon commented on HDFS-101: -- Hey Nicolas. Can you clarify what you mean that HDFS-793 should no longer be necessary? You mean that it's not necessary since we already have HDFS-872 applied to branch? I agree with that. Also, agree that we maintain wire compat with these patches. > DFS write pipeline : DFSClient sometimes does not detect second datanode > failure > - > > Key: HDFS-101 > URL: https://issues.apache.org/jira/browse/HDFS-101 > Project: Hadoop HDFS > Issue Type: Bug >Affects Versions: 0.20.1, 0.20-append >Reporter: Raghu Angadi >Assignee: Hairong Kuang >Priority: Blocker > Fix For: 0.20.2, 0.21.0 > > Attachments: detectDownDN-0.20.patch, detectDownDN1-0.20.patch, > detectDownDN2.patch, detectDownDN3-0.20-yahoo.patch, > detectDownDN3-0.20.patch, detectDownDN3.patch, hdfs-101.tar.gz, > HDFS-101_20-append.patch, pipelineHeartbeat.patch, > pipelineHeartbeat_yahoo.patch > > > When the first datanode's write to second datanode fails or times out > DFSClient ends up marking first datanode as the bad one and removes it from > the pipeline. Similar problem exists on DataNode as well and it is fixed in > HADOOP-3339. From HADOOP-3339 : > "The main issue is that BlockReceiver thread (and DataStreamer in the case of > DFSClient) interrupt() the 'responder' thread. But interrupting is a pretty > coarse control. We don't know what state the responder is in and interrupting > has different effects depending on responder state. To fix this properly we > need to redesign how we handle these interactions." > When the first datanode closes its socket from DFSClient, DFSClient should > properly read all the data left in the socket.. Also, DataNode's closing of > the socket should not result in a TCP reset, otherwise I think DFSClient will > not be able to read from the socket. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HDFS-101) DFS write pipeline : DFSClient sometimes does not detect second datanode failure
[ https://issues.apache.org/jira/browse/HDFS-101?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12847869#action_12847869 ] Hairong Kuang commented on HDFS-101: > is this latest patch applicable for branch-20 as well I do not think that it applies to 0.20. I opened a different jira for committing HDFS-101 to 0.20. I will work on it when I have time. > DFS write pipeline : DFSClient sometimes does not detect second datanode > failure > - > > Key: HDFS-101 > URL: https://issues.apache.org/jira/browse/HDFS-101 > Project: Hadoop HDFS > Issue Type: Bug >Affects Versions: 0.20.1 >Reporter: Raghu Angadi >Assignee: Hairong Kuang >Priority: Blocker > Fix For: 0.20.2, 0.21.0, 0.22.0 > > Attachments: detectDownDN-0.20.patch, detectDownDN1-0.20.patch, > detectDownDN2.patch, detectDownDN3-0.20-yahoo.patch, > detectDownDN3-0.20.patch, detectDownDN3.patch, hdfs-101.tar.gz, > pipelineHeartbeat.patch, pipelineHeartbeat_yahoo.patch > > > When the first datanode's write to second datanode fails or times out > DFSClient ends up marking first datanode as the bad one and removes it from > the pipeline. Similar problem exists on DataNode as well and it is fixed in > HADOOP-3339. From HADOOP-3339 : > "The main issue is that BlockReceiver thread (and DataStreamer in the case of > DFSClient) interrupt() the 'responder' thread. But interrupting is a pretty > coarse control. We don't know what state the responder is in and interrupting > has different effects depending on responder state. To fix this properly we > need to redesign how we handle these interactions." > When the first datanode closes its socket from DFSClient, DFSClient should > properly read all the data left in the socket.. Also, DataNode's closing of > the socket should not result in a TCP reset, otherwise I think DFSClient will > not be able to read from the socket. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HDFS-101) DFS write pipeline : DFSClient sometimes does not detect second datanode failure
[ https://issues.apache.org/jira/browse/HDFS-101?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12847848#action_12847848 ] Todd Lipcon commented on HDFS-101: -- Hey Hairong - is this latest patch applicable for branch-20 as well or is it unique to the way that HDFS-101 made it into ydist? (I haven't had the time to look at it in detail) > DFS write pipeline : DFSClient sometimes does not detect second datanode > failure > - > > Key: HDFS-101 > URL: https://issues.apache.org/jira/browse/HDFS-101 > Project: Hadoop HDFS > Issue Type: Bug >Affects Versions: 0.20.1 >Reporter: Raghu Angadi >Assignee: Hairong Kuang >Priority: Blocker > Fix For: 0.20.2, 0.21.0, 0.22.0 > > Attachments: detectDownDN-0.20.patch, detectDownDN1-0.20.patch, > detectDownDN2.patch, detectDownDN3-0.20-yahoo.patch, > detectDownDN3-0.20.patch, detectDownDN3.patch, hdfs-101.tar.gz, > pipelineHeartbeat.patch > > > When the first datanode's write to second datanode fails or times out > DFSClient ends up marking first datanode as the bad one and removes it from > the pipeline. Similar problem exists on DataNode as well and it is fixed in > HADOOP-3339. From HADOOP-3339 : > "The main issue is that BlockReceiver thread (and DataStreamer in the case of > DFSClient) interrupt() the 'responder' thread. But interrupting is a pretty > coarse control. We don't know what state the responder is in and interrupting > has different effects depending on responder state. To fix this properly we > need to redesign how we handle these interactions." > When the first datanode closes its socket from DFSClient, DFSClient should > properly read all the data left in the socket.. Also, DataNode's closing of > the socket should not result in a TCP reset, otherwise I think DFSClient will > not be able to read from the socket. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HDFS-101) DFS write pipeline : DFSClient sometimes does not detect second datanode failure
[ https://issues.apache.org/jira/browse/HDFS-101?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12802582#action_12802582 ] Hudson commented on HDFS-101: - Integrated in Hdfs-Patch-h5.grid.sp2.yahoo.net #196 (See [http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h5.grid.sp2.yahoo.net/196/]) > DFS write pipeline : DFSClient sometimes does not detect second datanode > failure > - > > Key: HDFS-101 > URL: https://issues.apache.org/jira/browse/HDFS-101 > Project: Hadoop HDFS > Issue Type: Bug >Affects Versions: 0.20.1 >Reporter: Raghu Angadi >Assignee: Hairong Kuang >Priority: Blocker > Fix For: 0.20.2, 0.21.0, 0.22.0 > > Attachments: detectDownDN-0.20.patch, detectDownDN1-0.20.patch, > detectDownDN2.patch, detectDownDN3-0.20.patch, detectDownDN3.patch, > hdfs-101.tar.gz > > > When the first datanode's write to second datanode fails or times out > DFSClient ends up marking first datanode as the bad one and removes it from > the pipeline. Similar problem exists on DataNode as well and it is fixed in > HADOOP-3339. From HADOOP-3339 : > "The main issue is that BlockReceiver thread (and DataStreamer in the case of > DFSClient) interrupt() the 'responder' thread. But interrupting is a pretty > coarse control. We don't know what state the responder is in and interrupting > has different effects depending on responder state. To fix this properly we > need to redesign how we handle these interactions." > When the first datanode closes its socket from DFSClient, DFSClient should > properly read all the data left in the socket.. Also, DataNode's closing of > the socket should not result in a TCP reset, otherwise I think DFSClient will > not be able to read from the socket. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HDFS-101) DFS write pipeline : DFSClient sometimes does not detect second datanode failure
[ https://issues.apache.org/jira/browse/HDFS-101?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12802192#action_12802192 ] Hudson commented on HDFS-101: - Integrated in Hdfs-Patch-h2.grid.sp2.yahoo.net #99 (See [http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h2.grid.sp2.yahoo.net/99/]) Move the change logs of HDFS-793 and from 0.20 section to 0.21 section > DFS write pipeline : DFSClient sometimes does not detect second datanode > failure > - > > Key: HDFS-101 > URL: https://issues.apache.org/jira/browse/HDFS-101 > Project: Hadoop HDFS > Issue Type: Bug >Affects Versions: 0.20.1 >Reporter: Raghu Angadi >Assignee: Hairong Kuang >Priority: Blocker > Fix For: 0.20.2, 0.21.0, 0.22.0 > > Attachments: detectDownDN-0.20.patch, detectDownDN1-0.20.patch, > detectDownDN2.patch, detectDownDN3-0.20.patch, detectDownDN3.patch, > hdfs-101.tar.gz > > > When the first datanode's write to second datanode fails or times out > DFSClient ends up marking first datanode as the bad one and removes it from > the pipeline. Similar problem exists on DataNode as well and it is fixed in > HADOOP-3339. From HADOOP-3339 : > "The main issue is that BlockReceiver thread (and DataStreamer in the case of > DFSClient) interrupt() the 'responder' thread. But interrupting is a pretty > coarse control. We don't know what state the responder is in and interrupting > has different effects depending on responder state. To fix this properly we > need to redesign how we handle these interactions." > When the first datanode closes its socket from DFSClient, DFSClient should > properly read all the data left in the socket.. Also, DataNode's closing of > the socket should not result in a TCP reset, otherwise I think DFSClient will > not be able to read from the socket. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HDFS-101) DFS write pipeline : DFSClient sometimes does not detect second datanode failure
[ https://issues.apache.org/jira/browse/HDFS-101?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12801169#action_12801169 ] Hudson commented on HDFS-101: - Integrated in Hadoop-Hdfs-trunk #202 (See [http://hudson.zones.apache.org/hudson/job/Hadoop-Hdfs-trunk/202/]) Move the change logs of HDFS-793 and from 0.20 section to 0.21 section > DFS write pipeline : DFSClient sometimes does not detect second datanode > failure > - > > Key: HDFS-101 > URL: https://issues.apache.org/jira/browse/HDFS-101 > Project: Hadoop HDFS > Issue Type: Bug >Affects Versions: 0.20.1 >Reporter: Raghu Angadi >Assignee: Hairong Kuang >Priority: Blocker > Fix For: 0.20.2, 0.21.0, 0.22.0 > > Attachments: detectDownDN-0.20.patch, detectDownDN1-0.20.patch, > detectDownDN2.patch, detectDownDN3-0.20.patch, detectDownDN3.patch, > hdfs-101.tar.gz > > > When the first datanode's write to second datanode fails or times out > DFSClient ends up marking first datanode as the bad one and removes it from > the pipeline. Similar problem exists on DataNode as well and it is fixed in > HADOOP-3339. From HADOOP-3339 : > "The main issue is that BlockReceiver thread (and DataStreamer in the case of > DFSClient) interrupt() the 'responder' thread. But interrupting is a pretty > coarse control. We don't know what state the responder is in and interrupting > has different effects depending on responder state. To fix this properly we > need to redesign how we handle these interactions." > When the first datanode closes its socket from DFSClient, DFSClient should > properly read all the data left in the socket.. Also, DataNode's closing of > the socket should not result in a TCP reset, otherwise I think DFSClient will > not be able to read from the socket. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HDFS-101) DFS write pipeline : DFSClient sometimes does not detect second datanode failure
[ https://issues.apache.org/jira/browse/HDFS-101?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12800855#action_12800855 ] Hudson commented on HDFS-101: - Integrated in Hadoop-Hdfs-trunk-Commit #172 (See [http://hudson.zones.apache.org/hudson/job/Hadoop-Hdfs-trunk-Commit/172/]) Move the change logs of HDFS-793 and from 0.20 section to 0.21 section > DFS write pipeline : DFSClient sometimes does not detect second datanode > failure > - > > Key: HDFS-101 > URL: https://issues.apache.org/jira/browse/HDFS-101 > Project: Hadoop HDFS > Issue Type: Bug >Affects Versions: 0.20.1 >Reporter: Raghu Angadi >Assignee: Hairong Kuang >Priority: Blocker > Fix For: 0.20.2, 0.21.0, 0.22.0 > > Attachments: detectDownDN-0.20.patch, detectDownDN1-0.20.patch, > detectDownDN2.patch, detectDownDN3-0.20.patch, detectDownDN3.patch, > hdfs-101.tar.gz > > > When the first datanode's write to second datanode fails or times out > DFSClient ends up marking first datanode as the bad one and removes it from > the pipeline. Similar problem exists on DataNode as well and it is fixed in > HADOOP-3339. From HADOOP-3339 : > "The main issue is that BlockReceiver thread (and DataStreamer in the case of > DFSClient) interrupt() the 'responder' thread. But interrupting is a pretty > coarse control. We don't know what state the responder is in and interrupting > has different effects depending on responder state. To fix this properly we > need to redesign how we handle these interactions." > When the first datanode closes its socket from DFSClient, DFSClient should > properly read all the data left in the socket.. Also, DataNode's closing of > the socket should not result in a TCP reset, otherwise I think DFSClient will > not be able to read from the socket. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HDFS-101) DFS write pipeline : DFSClient sometimes does not detect second datanode failure
[ https://issues.apache.org/jira/browse/HDFS-101?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12799019#action_12799019 ] Alex Loddengaard commented on HDFS-101: --- I will be out of the office Thursday, 1/7, through Wednesday, 1/13, back in the office Thursday, 1/14. I will be checking email fairly consistently in the evenings. Please contact Christophe Bisciglia (christo...@cloudera.com) with any support or training emergencies. Otherwise, you'll hear from me soon. Thanks, Alex > DFS write pipeline : DFSClient sometimes does not detect second datanode > failure > - > > Key: HDFS-101 > URL: https://issues.apache.org/jira/browse/HDFS-101 > Project: Hadoop HDFS > Issue Type: Bug >Affects Versions: 0.20.1 >Reporter: Raghu Angadi >Assignee: Hairong Kuang >Priority: Blocker > Fix For: 0.20.2, 0.21.0, 0.22.0 > > Attachments: detectDownDN-0.20.patch, detectDownDN1-0.20.patch, > detectDownDN2.patch, detectDownDN3-0.20.patch, detectDownDN3.patch, > hdfs-101.tar.gz > > > When the first datanode's write to second datanode fails or times out > DFSClient ends up marking first datanode as the bad one and removes it from > the pipeline. Similar problem exists on DataNode as well and it is fixed in > HADOOP-3339. From HADOOP-3339 : > "The main issue is that BlockReceiver thread (and DataStreamer in the case of > DFSClient) interrupt() the 'responder' thread. But interrupting is a pretty > coarse control. We don't know what state the responder is in and interrupting > has different effects depending on responder state. To fix this properly we > need to redesign how we handle these interactions." > When the first datanode closes its socket from DFSClient, DFSClient should > properly read all the data left in the socket.. Also, DataNode's closing of > the socket should not result in a TCP reset, otherwise I think DFSClient will > not be able to read from the socket. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HDFS-101) DFS write pipeline : DFSClient sometimes does not detect second datanode failure
[ https://issues.apache.org/jira/browse/HDFS-101?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12798992#action_12798992 ] Hudson commented on HDFS-101: - Integrated in Hdfs-Patch-h2.grid.sp2.yahoo.net #94 (See [http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h2.grid.sp2.yahoo.net/94/]) > DFS write pipeline : DFSClient sometimes does not detect second datanode > failure > - > > Key: HDFS-101 > URL: https://issues.apache.org/jira/browse/HDFS-101 > Project: Hadoop HDFS > Issue Type: Bug >Affects Versions: 0.20.1 >Reporter: Raghu Angadi >Assignee: Hairong Kuang >Priority: Blocker > Fix For: 0.20.2, 0.21.0, 0.22.0 > > Attachments: detectDownDN-0.20.patch, detectDownDN1-0.20.patch, > detectDownDN2.patch, detectDownDN3-0.20.patch, detectDownDN3.patch, > hdfs-101.tar.gz > > > When the first datanode's write to second datanode fails or times out > DFSClient ends up marking first datanode as the bad one and removes it from > the pipeline. Similar problem exists on DataNode as well and it is fixed in > HADOOP-3339. From HADOOP-3339 : > "The main issue is that BlockReceiver thread (and DataStreamer in the case of > DFSClient) interrupt() the 'responder' thread. But interrupting is a pretty > coarse control. We don't know what state the responder is in and interrupting > has different effects depending on responder state. To fix this properly we > need to redesign how we handle these interactions." > When the first datanode closes its socket from DFSClient, DFSClient should > properly read all the data left in the socket.. Also, DataNode's closing of > the socket should not result in a TCP reset, otherwise I think DFSClient will > not be able to read from the socket. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HDFS-101) DFS write pipeline : DFSClient sometimes does not detect second datanode failure
[ https://issues.apache.org/jira/browse/HDFS-101?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12794610#action_12794610 ] Hudson commented on HDFS-101: - Integrated in Hadoop-Hdfs-trunk #182 (See [http://hudson.zones.apache.org/hudson/job/Hadoop-Hdfs-trunk/182/]) > DFS write pipeline : DFSClient sometimes does not detect second datanode > failure > - > > Key: HDFS-101 > URL: https://issues.apache.org/jira/browse/HDFS-101 > Project: Hadoop HDFS > Issue Type: Bug >Affects Versions: 0.20.1 >Reporter: Raghu Angadi >Assignee: Hairong Kuang >Priority: Blocker > Fix For: 0.20.2, 0.21.0, 0.22.0 > > Attachments: detectDownDN-0.20.patch, detectDownDN1-0.20.patch, > detectDownDN2.patch, detectDownDN3-0.20.patch, detectDownDN3.patch, > hdfs-101.tar.gz > > > When the first datanode's write to second datanode fails or times out > DFSClient ends up marking first datanode as the bad one and removes it from > the pipeline. Similar problem exists on DataNode as well and it is fixed in > HADOOP-3339. From HADOOP-3339 : > "The main issue is that BlockReceiver thread (and DataStreamer in the case of > DFSClient) interrupt() the 'responder' thread. But interrupting is a pretty > coarse control. We don't know what state the responder is in and interrupting > has different effects depending on responder state. To fix this properly we > need to redesign how we handle these interactions." > When the first datanode closes its socket from DFSClient, DFSClient should > properly read all the data left in the socket.. Also, DataNode's closing of > the socket should not result in a TCP reset, otherwise I think DFSClient will > not be able to read from the socket. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HDFS-101) DFS write pipeline : DFSClient sometimes does not detect second datanode failure
[ https://issues.apache.org/jira/browse/HDFS-101?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12794348#action_12794348 ] Hudson commented on HDFS-101: - Integrated in Hdfs-Patch-h5.grid.sp2.yahoo.net #159 (See [http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h5.grid.sp2.yahoo.net/159/]) > DFS write pipeline : DFSClient sometimes does not detect second datanode > failure > - > > Key: HDFS-101 > URL: https://issues.apache.org/jira/browse/HDFS-101 > Project: Hadoop HDFS > Issue Type: Bug >Affects Versions: 0.20.1 >Reporter: Raghu Angadi >Assignee: Hairong Kuang >Priority: Blocker > Fix For: 0.20.2, 0.21.0, 0.22.0 > > Attachments: detectDownDN-0.20.patch, detectDownDN1-0.20.patch, > detectDownDN2.patch, detectDownDN3-0.20.patch, detectDownDN3.patch, > hdfs-101.tar.gz > > > When the first datanode's write to second datanode fails or times out > DFSClient ends up marking first datanode as the bad one and removes it from > the pipeline. Similar problem exists on DataNode as well and it is fixed in > HADOOP-3339. From HADOOP-3339 : > "The main issue is that BlockReceiver thread (and DataStreamer in the case of > DFSClient) interrupt() the 'responder' thread. But interrupting is a pretty > coarse control. We don't know what state the responder is in and interrupting > has different effects depending on responder state. To fix this properly we > need to redesign how we handle these interactions." > When the first datanode closes its socket from DFSClient, DFSClient should > properly read all the data left in the socket.. Also, DataNode's closing of > the socket should not result in a TCP reset, otherwise I think DFSClient will > not be able to read from the socket. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HDFS-101) DFS write pipeline : DFSClient sometimes does not detect second datanode failure
[ https://issues.apache.org/jira/browse/HDFS-101?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12793393#action_12793393 ] Hudson commented on HDFS-101: - Integrated in Hadoop-Hdfs-trunk-Commit #152 (See [http://hudson.zones.apache.org/hudson/job/Hadoop-Hdfs-trunk-Commit/152/]) . DFS write pipeline: DFSClient sometimes does not detect second datanode failure. Contributed by Hairong Kuang. > DFS write pipeline : DFSClient sometimes does not detect second datanode > failure > - > > Key: HDFS-101 > URL: https://issues.apache.org/jira/browse/HDFS-101 > Project: Hadoop HDFS > Issue Type: Bug >Affects Versions: 0.20.1 >Reporter: Raghu Angadi >Assignee: Hairong Kuang >Priority: Blocker > Fix For: 0.21.0 > > Attachments: detectDownDN-0.20.patch, detectDownDN1-0.20.patch, > detectDownDN2.patch, detectDownDN3-0.20.patch, detectDownDN3.patch, > hdfs-101.tar.gz > > > When the first datanode's write to second datanode fails or times out > DFSClient ends up marking first datanode as the bad one and removes it from > the pipeline. Similar problem exists on DataNode as well and it is fixed in > HADOOP-3339. From HADOOP-3339 : > "The main issue is that BlockReceiver thread (and DataStreamer in the case of > DFSClient) interrupt() the 'responder' thread. But interrupting is a pretty > coarse control. We don't know what state the responder is in and interrupting > has different effects depending on responder state. To fix this properly we > need to redesign how we handle these interactions." > When the first datanode closes its socket from DFSClient, DFSClient should > properly read all the data left in the socket.. Also, DataNode's closing of > the socket should not result in a TCP reset, otherwise I think DFSClient will > not be able to read from the socket. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HDFS-101) DFS write pipeline : DFSClient sometimes does not detect second datanode failure
[ https://issues.apache.org/jira/browse/HDFS-101?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12793372#action_12793372 ] Tsz Wo (Nicholas), SZE commented on HDFS-101: - Hi Todd, thank you for testing it. Could you post the log of DN .82 which shows the DiskOutOfSpaceException stack trace in [your test cases|https://issues.apache.org/jira/browse/HDFS-101?focusedCommentId=12792245&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12792245]? We may add new unit tests for your test cases. > DFS write pipeline : DFSClient sometimes does not detect second datanode > failure > - > > Key: HDFS-101 > URL: https://issues.apache.org/jira/browse/HDFS-101 > Project: Hadoop HDFS > Issue Type: Bug >Affects Versions: 0.20.1 >Reporter: Raghu Angadi >Assignee: Hairong Kuang >Priority: Blocker > Fix For: 0.21.0 > > Attachments: detectDownDN-0.20.patch, detectDownDN1-0.20.patch, > detectDownDN2.patch, detectDownDN3-0.20.patch, detectDownDN3.patch, > hdfs-101.tar.gz > > > When the first datanode's write to second datanode fails or times out > DFSClient ends up marking first datanode as the bad one and removes it from > the pipeline. Similar problem exists on DataNode as well and it is fixed in > HADOOP-3339. From HADOOP-3339 : > "The main issue is that BlockReceiver thread (and DataStreamer in the case of > DFSClient) interrupt() the 'responder' thread. But interrupting is a pretty > coarse control. We don't know what state the responder is in and interrupting > has different effects depending on responder state. To fix this properly we > need to redesign how we handle these interactions." > When the first datanode closes its socket from DFSClient, DFSClient should > properly read all the data left in the socket.. Also, DataNode's closing of > the socket should not result in a TCP reset, otherwise I think DFSClient will > not be able to read from the socket. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HDFS-101) DFS write pipeline : DFSClient sometimes does not detect second datanode failure
[ https://issues.apache.org/jira/browse/HDFS-101?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12792677#action_12792677 ] dhruba borthakur commented on HDFS-101: --- +1 Code looks good. The unit test will be difficult to write. I will look at the unit test when you post it. > DFS write pipeline : DFSClient sometimes does not detect second datanode > failure > - > > Key: HDFS-101 > URL: https://issues.apache.org/jira/browse/HDFS-101 > Project: Hadoop HDFS > Issue Type: Bug >Affects Versions: 0.20.1 >Reporter: Raghu Angadi >Assignee: Hairong Kuang >Priority: Blocker > Fix For: 0.21.0 > > Attachments: detectDownDN-0.20.patch, detectDownDN1-0.20.patch, > detectDownDN2.patch, hdfs-101.tar.gz > > > When the first datanode's write to second datanode fails or times out > DFSClient ends up marking first datanode as the bad one and removes it from > the pipeline. Similar problem exists on DataNode as well and it is fixed in > HADOOP-3339. From HADOOP-3339 : > "The main issue is that BlockReceiver thread (and DataStreamer in the case of > DFSClient) interrupt() the 'responder' thread. But interrupting is a pretty > coarse control. We don't know what state the responder is in and interrupting > has different effects depending on responder state. To fix this properly we > need to redesign how we handle these interactions." > When the first datanode closes its socket from DFSClient, DFSClient should > properly read all the data left in the socket.. Also, DataNode's closing of > the socket should not result in a TCP reset, otherwise I think DFSClient will > not be able to read from the socket. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HDFS-101) DFS write pipeline : DFSClient sometimes does not detect second datanode failure
[ https://issues.apache.org/jira/browse/HDFS-101?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12792572#action_12792572 ] Hairong Kuang commented on HDFS-101: > Do you know of a good way to manually trigger this? Increase a file's replication factor from 1 to 3 will trigger this. But I do not think my patch changes any replication behavior because IOException is not thrown in case of block replication. > DFS write pipeline : DFSClient sometimes does not detect second datanode > failure > - > > Key: HDFS-101 > URL: https://issues.apache.org/jira/browse/HDFS-101 > Project: Hadoop HDFS > Issue Type: Bug >Affects Versions: 0.20.1 >Reporter: Raghu Angadi >Assignee: Hairong Kuang >Priority: Blocker > Fix For: 0.21.0 > > Attachments: detectDownDN-0.20.patch, detectDownDN1-0.20.patch, > detectDownDN2.patch, hdfs-101.tar.gz > > > When the first datanode's write to second datanode fails or times out > DFSClient ends up marking first datanode as the bad one and removes it from > the pipeline. Similar problem exists on DataNode as well and it is fixed in > HADOOP-3339. From HADOOP-3339 : > "The main issue is that BlockReceiver thread (and DataStreamer in the case of > DFSClient) interrupt() the 'responder' thread. But interrupting is a pretty > coarse control. We don't know what state the responder is in and interrupting > has different effects depending on responder state. To fix this properly we > need to redesign how we handle these interactions." > When the first datanode closes its socket from DFSClient, DFSClient should > properly read all the data left in the socket.. Also, DataNode's closing of > the socket should not result in a TCP reset, otherwise I think DFSClient will > not be able to read from the socket. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HDFS-101) DFS write pipeline : DFSClient sometimes does not detect second datanode failure
[ https://issues.apache.org/jira/browse/HDFS-101?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12792473#action_12792473 ] dhruba borthakur commented on HDFS-101: --- you are right. transferBlocks could take multiple targets. That means clientName.len could be zero while at the same time mirror can be non-null. > DFS write pipeline : DFSClient sometimes does not detect second datanode > failure > - > > Key: HDFS-101 > URL: https://issues.apache.org/jira/browse/HDFS-101 > Project: Hadoop HDFS > Issue Type: Bug >Affects Versions: 0.20.1 >Reporter: Raghu Angadi >Assignee: Hairong Kuang >Priority: Blocker > Fix For: 0.21.0 > > Attachments: detectDownDN-0.20.patch, detectDownDN1-0.20.patch, > detectDownDN2.patch, hdfs-101.tar.gz > > > When the first datanode's write to second datanode fails or times out > DFSClient ends up marking first datanode as the bad one and removes it from > the pipeline. Similar problem exists on DataNode as well and it is fixed in > HADOOP-3339. From HADOOP-3339 : > "The main issue is that BlockReceiver thread (and DataStreamer in the case of > DFSClient) interrupt() the 'responder' thread. But interrupting is a pretty > coarse control. We don't know what state the responder is in and interrupting > has different effects depending on responder state. To fix this properly we > need to redesign how we handle these interactions." > When the first datanode closes its socket from DFSClient, DFSClient should > properly read all the data left in the socket.. Also, DataNode's closing of > the socket should not result in a TCP reset, otherwise I think DFSClient will > not be able to read from the socket. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HDFS-101) DFS write pipeline : DFSClient sometimes does not detect second datanode failure
[ https://issues.apache.org/jira/browse/HDFS-101?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12792375#action_12792375 ] Todd Lipcon commented on HDFS-101: -- Ah, I never realized that transferBlocks could take multiple targets. Do you know of a good way to manually trigger this? Would be good to verify that it's still working well in branch-20 after this patch, if it's not too hard. (or, is it covered by unit tests?) > DFS write pipeline : DFSClient sometimes does not detect second datanode > failure > - > > Key: HDFS-101 > URL: https://issues.apache.org/jira/browse/HDFS-101 > Project: Hadoop HDFS > Issue Type: Bug >Affects Versions: 0.20.1 >Reporter: Raghu Angadi >Assignee: Hairong Kuang >Priority: Blocker > Fix For: 0.21.0 > > Attachments: detectDownDN-0.20.patch, detectDownDN1-0.20.patch, > detectDownDN2.patch, hdfs-101.tar.gz > > > When the first datanode's write to second datanode fails or times out > DFSClient ends up marking first datanode as the bad one and removes it from > the pipeline. Similar problem exists on DataNode as well and it is fixed in > HADOOP-3339. From HADOOP-3339 : > "The main issue is that BlockReceiver thread (and DataStreamer in the case of > DFSClient) interrupt() the 'responder' thread. But interrupting is a pretty > coarse control. We don't know what state the responder is in and interrupting > has different effects depending on responder state. To fix this properly we > need to redesign how we handle these interactions." > When the first datanode closes its socket from DFSClient, DFSClient should > properly read all the data left in the socket.. Also, DataNode's closing of > the socket should not result in a TCP reset, otherwise I think DFSClient will > not be able to read from the socket. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HDFS-101) DFS write pipeline : DFSClient sometimes does not detect second datanode failure
[ https://issues.apache.org/jira/browse/HDFS-101?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12792373#action_12792373 ] Hairong Kuang commented on HDFS-101: Todd and Dhruba, thanks a lot for your testing and review. > replication request always have a pipeline of size 1. I do not think this is true. A replication pipeline could have a size which is greater than 1. So it is possible that client length == 0 but mirror != null. > DFS write pipeline : DFSClient sometimes does not detect second datanode > failure > - > > Key: HDFS-101 > URL: https://issues.apache.org/jira/browse/HDFS-101 > Project: Hadoop HDFS > Issue Type: Bug >Affects Versions: 0.20.1 >Reporter: Raghu Angadi >Assignee: Hairong Kuang >Priority: Blocker > Fix For: 0.21.0 > > Attachments: detectDownDN-0.20.patch, detectDownDN1-0.20.patch, > detectDownDN2.patch, hdfs-101.tar.gz > > > When the first datanode's write to second datanode fails or times out > DFSClient ends up marking first datanode as the bad one and removes it from > the pipeline. Similar problem exists on DataNode as well and it is fixed in > HADOOP-3339. From HADOOP-3339 : > "The main issue is that BlockReceiver thread (and DataStreamer in the case of > DFSClient) interrupt() the 'responder' thread. But interrupting is a pretty > coarse control. We don't know what state the responder is in and interrupting > has different effects depending on responder state. To fix this properly we > need to redesign how we handle these interactions." > When the first datanode closes its socket from DFSClient, DFSClient should > properly read all the data left in the socket.. Also, DataNode's closing of > the socket should not result in a TCP reset, otherwise I think DFSClient will > not be able to read from the socket. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HDFS-101) DFS write pipeline : DFSClient sometimes does not detect second datanode failure
[ https://issues.apache.org/jira/browse/HDFS-101?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12792358#action_12792358 ] dhruba borthakur commented on HDFS-101: --- I am looking at the patch too. Looks good at first sight. > clientName.len == 0 means that this is a block copy for replication. It has > nothing to do if this is the last DN in pipeline or not. I agree. To elaborate, replication request always have a pipeline of size 1. That means that isn't any mirror in this case. > DFS write pipeline : DFSClient sometimes does not detect second datanode > failure > - > > Key: HDFS-101 > URL: https://issues.apache.org/jira/browse/HDFS-101 > Project: Hadoop HDFS > Issue Type: Bug >Affects Versions: 0.20.1 >Reporter: Raghu Angadi >Assignee: Hairong Kuang >Priority: Blocker > Fix For: 0.21.0 > > Attachments: detectDownDN-0.20.patch, detectDownDN1-0.20.patch, > detectDownDN2.patch, hdfs-101.tar.gz > > > When the first datanode's write to second datanode fails or times out > DFSClient ends up marking first datanode as the bad one and removes it from > the pipeline. Similar problem exists on DataNode as well and it is fixed in > HADOOP-3339. From HADOOP-3339 : > "The main issue is that BlockReceiver thread (and DataStreamer in the case of > DFSClient) interrupt() the 'responder' thread. But interrupting is a pretty > coarse control. We don't know what state the responder is in and interrupting > has different effects depending on responder state. To fix this properly we > need to redesign how we handle these interactions." > When the first datanode closes its socket from DFSClient, DFSClient should > properly read all the data left in the socket.. Also, DataNode's closing of > the socket should not result in a TCP reset, otherwise I think DFSClient will > not be able to read from the socket. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HDFS-101) DFS write pipeline : DFSClient sometimes does not detect second datanode failure
[ https://issues.apache.org/jira/browse/HDFS-101?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12792287#action_12792287 ] Hadoop QA commented on HDFS-101: -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12428366/detectDownDN1.patch against trunk revision 891593. +1 @author. The patch does not contain any @author tags. -1 tests included. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h5.grid.sp2.yahoo.net/150/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h5.grid.sp2.yahoo.net/150/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h5.grid.sp2.yahoo.net/150/artifact/trunk/build/test/checkstyle-errors.html Console output: http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h5.grid.sp2.yahoo.net/150/console This message is automatically generated. > DFS write pipeline : DFSClient sometimes does not detect second datanode > failure > - > > Key: HDFS-101 > URL: https://issues.apache.org/jira/browse/HDFS-101 > Project: Hadoop HDFS > Issue Type: Bug >Affects Versions: 0.20.1 >Reporter: Raghu Angadi >Assignee: Hairong Kuang >Priority: Blocker > Fix For: 0.21.0 > > Attachments: detectDownDN-0.20.patch, detectDownDN1-0.20.patch, > detectDownDN2.patch, hdfs-101.tar.gz > > > When the first datanode's write to second datanode fails or times out > DFSClient ends up marking first datanode as the bad one and removes it from > the pipeline. Similar problem exists on DataNode as well and it is fixed in > HADOOP-3339. From HADOOP-3339 : > "The main issue is that BlockReceiver thread (and DataStreamer in the case of > DFSClient) interrupt() the 'responder' thread. But interrupting is a pretty > coarse control. We don't know what state the responder is in and interrupting > has different effects depending on responder state. To fix this properly we > need to redesign how we handle these interactions." > When the first datanode closes its socket from DFSClient, DFSClient should > properly read all the data left in the socket.. Also, DataNode's closing of > the socket should not result in a TCP reset, otherwise I think DFSClient will > not be able to read from the socket. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HDFS-101) DFS write pipeline : DFSClient sometimes does not detect second datanode failure
[ https://issues.apache.org/jira/browse/HDFS-101?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12792285#action_12792285 ] Todd Lipcon commented on HDFS-101: -- Just applied https://issues.apache.org/jira/secure/attachment/12428383/detectDownDN1-0.20.patch and tested on the cluster. I think the other error I mentioned above is just HDFS-630, since I'm testing on 0.20 on a 3-node cluster, so +1 on this patch. bq. clientName.len == 0 means that this is a block copy for replication. It has nothing to do if this is the last DN in pipeline or not. Right, but my question is whether clientName.len can ever be 0 when there's a mirror. My belief is no. Perhaps it's worth an assert there (since we're now cool with assertions in HDFS) > DFS write pipeline : DFSClient sometimes does not detect second datanode > failure > - > > Key: HDFS-101 > URL: https://issues.apache.org/jira/browse/HDFS-101 > Project: Hadoop HDFS > Issue Type: Bug >Affects Versions: 0.20.1 >Reporter: Raghu Angadi >Assignee: Hairong Kuang >Priority: Blocker > Fix For: 0.21.0 > > Attachments: detectDownDN-0.20.patch, detectDownDN1-0.20.patch, > detectDownDN2.patch, hdfs-101.tar.gz > > > When the first datanode's write to second datanode fails or times out > DFSClient ends up marking first datanode as the bad one and removes it from > the pipeline. Similar problem exists on DataNode as well and it is fixed in > HADOOP-3339. From HADOOP-3339 : > "The main issue is that BlockReceiver thread (and DataStreamer in the case of > DFSClient) interrupt() the 'responder' thread. But interrupting is a pretty > coarse control. We don't know what state the responder is in and interrupting > has different effects depending on responder state. To fix this properly we > need to redesign how we handle these interactions." > When the first datanode closes its socket from DFSClient, DFSClient should > properly read all the data left in the socket.. Also, DataNode's closing of > the socket should not result in a TCP reset, otherwise I think DFSClient will > not be able to read from the socket. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HDFS-101) DFS write pipeline : DFSClient sometimes does not detect second datanode failure
[ https://issues.apache.org/jira/browse/HDFS-101?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12792275#action_12792275 ] Todd Lipcon commented on HDFS-101: -- err, sorry, correction to above. I forcibly killed the DN on *10.251.66.212*. So the detection of down node was correct, it was just failure recovery that was problematic. This might be related to HDFS-630. > DFS write pipeline : DFSClient sometimes does not detect second datanode > failure > - > > Key: HDFS-101 > URL: https://issues.apache.org/jira/browse/HDFS-101 > Project: Hadoop HDFS > Issue Type: Bug >Affects Versions: 0.20.1 >Reporter: Raghu Angadi >Assignee: Hairong Kuang >Priority: Blocker > Fix For: 0.21.0 > > Attachments: detectDownDN-0.20.patch, detectDownDN.patch, > detectDownDN1-0.20.patch, detectDownDN1.patch, hdfs-101.tar.gz > > > When the first datanode's write to second datanode fails or times out > DFSClient ends up marking first datanode as the bad one and removes it from > the pipeline. Similar problem exists on DataNode as well and it is fixed in > HADOOP-3339. From HADOOP-3339 : > "The main issue is that BlockReceiver thread (and DataStreamer in the case of > DFSClient) interrupt() the 'responder' thread. But interrupting is a pretty > coarse control. We don't know what state the responder is in and interrupting > has different effects depending on responder state. To fix this properly we > need to redesign how we handle these interactions." > When the first datanode closes its socket from DFSClient, DFSClient should > properly read all the data left in the socket.. Also, DataNode's closing of > the socket should not result in a TCP reset, otherwise I think DFSClient will > not be able to read from the socket. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HDFS-101) DFS write pipeline : DFSClient sometimes does not detect second datanode failure
[ https://issues.apache.org/jira/browse/HDFS-101?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12792274#action_12792274 ] Todd Lipcon commented on HDFS-101: -- As a second test of the above modification, I started uploading a 1G file, then forceably killed the DN on 10.250.7.148: 09/12/17 20:14:53 WARN hdfs.DFSClient: DFSOutputStream ResponseProcessor exception for block blk_-8026763677133524198_1407java.io.IOException: Bad response 1 for block blk_-8026763677133524198_1407 from datanode 10.251.66.212:50010 at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$ResponseProcessor.run(DFSClient.java:2427) 09/12/17 20:14:53 WARN hdfs.DFSClient: Error Recovery for block blk_-8026763677133524198_1407 bad datanode[2] 10.251.66.212:50010 09/12/17 20:14:53 WARN hdfs.DFSClient: Error Recovery for block blk_-8026763677133524198_1407 in pipeline 10.250.7.148:50010, 10.251.43.82:50010, 10.251.66.212:50010: bad datanode 10.251.66.212:50010 09/12/17 20:14:54 INFO hdfs.DFSClient: Exception in createBlockOutputStream java.io.IOException: Bad connect ack with firstBadLink 10.251.66.212:50010 09/12/17 20:14:54 INFO hdfs.DFSClient: Abandoning block blk_-3750676278765626865_1408 09/12/17 20:15:00 INFO hdfs.DFSClient: Exception in createBlockOutputStream java.io.IOException: Bad connect ack with firstBadLink 10.251.66.212:50010 09/12/17 20:15:00 INFO hdfs.DFSClient: Abandoning block blk_7561780221358446528_1408 09/12/17 20:15:06 INFO hdfs.DFSClient: Exception in createBlockOutputStream java.io.IOException: Bad connect ack with firstBadLink 10.251.66.212:50010 09/12/17 20:15:06 INFO hdfs.DFSClient: Abandoning block blk_-8059177057921476468_1408 09/12/17 20:15:12 INFO hdfs.DFSClient: Exception in createBlockOutputStream java.net.ConnectException: Connection refused 09/12/17 20:15:12 INFO hdfs.DFSClient: Abandoning block blk_-8264633252613228869_1408 09/12/17 20:15:18 WARN hdfs.DFSClient: DataStreamer Exception: java.io.IOException: Unable to create new block. at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.nextBlockOutputStream(DFSClient.java:2818) at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$2000(DFSClient.java:2078) at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2264) 09/12/17 20:15:18 WARN hdfs.DFSClient: Error Recovery for block blk_-8264633252613228869_1408 bad datanode[0] nodes == null 09/12/17 20:15:18 WARN hdfs.DFSClient: Could not get block locations. Source file "/user/root/1261098884" - Aborting... put: Connection refused 09/12/17 20:15:18 ERROR hdfs.DFSClient: Exception closing file /user/root/1261098884 : java.net.ConnectException: Connection refused java.net.ConnectException: Connection refused at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:574) at org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:206) at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:404) at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.createBlockOutputStream(DFSClient.java:2843) at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.nextBlockOutputStream(DFSClient.java:2799) at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$2000(DFSClient.java:2078) at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2264) As you can see above, it correctly detected the down DN. But the second block of the file failed to write (the file left on HDFS at the end was exactly 128M). fsck -openforwrite shows that the file is still open: OPENFORWRITE: ./user/root/1261098884 134217728 bytes, 1 block(s), OPENFORWRITE: /user/root/1261098884: Under replicated blk_-8026763677133524198_1408. Target Replicas is 3 but found 2 replica(s). > DFS write pipeline : DFSClient sometimes does not detect second datanode > failure > - > > Key: HDFS-101 > URL: https://issues.apache.org/jira/browse/HDFS-101 > Project: Hadoop HDFS > Issue Type: Bug >Affects Versions: 0.20.1 >Reporter: Raghu Angadi >Assignee: Hairong Kuang >Priority: Blocker > Fix For: 0.21.0 > > Attachments: detectDownDN-0.20.patch, detectDownDN.patch, > detectDownDN1-0.20.patch, detectDownDN1.patch, hdfs-101.tar.gz > > > When the first datanode's write to second datanode fails or times out > DFSClient ends up marking first datanode as the bad one and removes it from > the pipeline. Similar problem exists on DataNode as well and it is fixed in > HADOOP-3339. From HADOOP-3339 : > "The main issue is that BlockReceiver thread (and DataStreamer in the case of > DFSClient) interrupt() the 'responder' t
[jira] Commented: (HDFS-101) DFS write pipeline : DFSClient sometimes does not detect second datanode failure
[ https://issues.apache.org/jira/browse/HDFS-101?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12792261#action_12792261 ] Todd Lipcon commented on HDFS-101: -- I just tried changing that if statement to == 0 instead of > 0, and it seems to have fixed the bug for me. I reran the above test and it successfully ejected the full node: 09/12/17 19:59:21 WARN hdfs.DFSClient: DFSOutputStream ResponseProcessor exception for block blk_-1132852588861426806_1405java.io.IOException: Bad response 1 for block blk_-1132852588861426806_1405 from datanode 10.251.43.82:50010 at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$ResponseProcessor.run(DFSClient.java:2427) 09/12/17 19:59:21 WARN hdfs.DFSClient: Error Recovery for block blk_-1132852588861426806_1405 bad datanode[1] 10.251.43.82:50010 09/12/17 19:59:21 WARN hdfs.DFSClient: Error Recovery for block blk_-1132852588861426806_1405 in pipeline 10.250.7.148:50010, 10.251.43.82:50010, 10.251.66.212:50010: bad datanode 10.251.43.82:50010 Is it possible that clientName.length() would ever be equal to t0 in handleMirrorOutError? I'm new to this area of the code, so I may be missing something, but as I understand it, clientName.length() is 0 only for inter-datanode replication requests. For those writes, I didn't think any pipelining was done. > DFS write pipeline : DFSClient sometimes does not detect second datanode > failure > - > > Key: HDFS-101 > URL: https://issues.apache.org/jira/browse/HDFS-101 > Project: Hadoop HDFS > Issue Type: Bug >Affects Versions: 0.20.1 >Reporter: Raghu Angadi >Assignee: Hairong Kuang >Priority: Blocker > Fix For: 0.21.0 > > Attachments: detectDownDN-0.20.patch, detectDownDN.patch, > detectDownDN1.patch, hdfs-101.tar.gz > > > When the first datanode's write to second datanode fails or times out > DFSClient ends up marking first datanode as the bad one and removes it from > the pipeline. Similar problem exists on DataNode as well and it is fixed in > HADOOP-3339. From HADOOP-3339 : > "The main issue is that BlockReceiver thread (and DataStreamer in the case of > DFSClient) interrupt() the 'responder' thread. But interrupting is a pretty > coarse control. We don't know what state the responder is in and interrupting > has different effects depending on responder state. To fix this properly we > need to redesign how we handle these interactions." > When the first datanode closes its socket from DFSClient, DFSClient should > properly read all the data left in the socket.. Also, DataNode's closing of > the socket should not result in a TCP reset, otherwise I think DFSClient will > not be able to read from the socket. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HDFS-101) DFS write pipeline : DFSClient sometimes does not detect second datanode failure
[ https://issues.apache.org/jira/browse/HDFS-101?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12792256#action_12792256 ] Todd Lipcon commented on HDFS-101: -- Come to think of it, it can never be the last node in the pipeline inside handleMirrorOutError, since there's no mirror out to have errors (duh!). But, not sure why that if statement exists at all, then. An error writing to the mirror indicates a problem with the mirror more than a problem with yourself. > DFS write pipeline : DFSClient sometimes does not detect second datanode > failure > - > > Key: HDFS-101 > URL: https://issues.apache.org/jira/browse/HDFS-101 > Project: Hadoop HDFS > Issue Type: Bug >Affects Versions: 0.20.1 >Reporter: Raghu Angadi >Assignee: Hairong Kuang >Priority: Blocker > Fix For: 0.21.0 > > Attachments: detectDownDN-0.20.patch, detectDownDN.patch, > detectDownDN1.patch, hdfs-101.tar.gz > > > When the first datanode's write to second datanode fails or times out > DFSClient ends up marking first datanode as the bad one and removes it from > the pipeline. Similar problem exists on DataNode as well and it is fixed in > HADOOP-3339. From HADOOP-3339 : > "The main issue is that BlockReceiver thread (and DataStreamer in the case of > DFSClient) interrupt() the 'responder' thread. But interrupting is a pretty > coarse control. We don't know what state the responder is in and interrupting > has different effects depending on responder state. To fix this properly we > need to redesign how we handle these interactions." > When the first datanode closes its socket from DFSClient, DFSClient should > properly read all the data left in the socket.. Also, DataNode's closing of > the socket should not result in a TCP reset, otherwise I think DFSClient will > not be able to read from the socket. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HDFS-101) DFS write pipeline : DFSClient sometimes does not detect second datanode failure
[ https://issues.apache.org/jira/browse/HDFS-101?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12792248#action_12792248 ] Todd Lipcon commented on HDFS-101: -- Looking at the patch, I'm confused by the logic in handleMirrorOutError. It seems to me that the if statement there checking for a non-empty clientName should actually be checking whether it's the last node in the pipeline. If it's the first node in the pipeline, shouldn't it propagate the error backward just as if it received the error while receiving an ack? > DFS write pipeline : DFSClient sometimes does not detect second datanode > failure > - > > Key: HDFS-101 > URL: https://issues.apache.org/jira/browse/HDFS-101 > Project: Hadoop HDFS > Issue Type: Bug >Affects Versions: 0.20.1 >Reporter: Raghu Angadi >Assignee: Hairong Kuang >Priority: Blocker > Fix For: 0.21.0 > > Attachments: detectDownDN-0.20.patch, detectDownDN.patch, > detectDownDN1.patch, hdfs-101.tar.gz > > > When the first datanode's write to second datanode fails or times out > DFSClient ends up marking first datanode as the bad one and removes it from > the pipeline. Similar problem exists on DataNode as well and it is fixed in > HADOOP-3339. From HADOOP-3339 : > "The main issue is that BlockReceiver thread (and DataStreamer in the case of > DFSClient) interrupt() the 'responder' thread. But interrupting is a pretty > coarse control. We don't know what state the responder is in and interrupting > has different effects depending on responder state. To fix this properly we > need to redesign how we handle these interactions." > When the first datanode closes its socket from DFSClient, DFSClient should > properly read all the data left in the socket.. Also, DataNode's closing of > the socket should not result in a TCP reset, otherwise I think DFSClient will > not be able to read from the socket. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HDFS-101) DFS write pipeline : DFSClient sometimes does not detect second datanode failure
[ https://issues.apache.org/jira/browse/HDFS-101?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12792114#action_12792114 ] Todd Lipcon commented on HDFS-101: -- Any chance that you have a patch available for branch-20 as well? The cluster where I can reliably reproduce this is running 0.20.1, so would like to test there as well as looking at it on trunk. > DFS write pipeline : DFSClient sometimes does not detect second datanode > failure > - > > Key: HDFS-101 > URL: https://issues.apache.org/jira/browse/HDFS-101 > Project: Hadoop HDFS > Issue Type: Bug >Affects Versions: 0.20.1 >Reporter: Raghu Angadi >Assignee: Hairong Kuang >Priority: Blocker > Fix For: 0.21.0 > > Attachments: detectDownDN.patch > > > When the first datanode's write to second datanode fails or times out > DFSClient ends up marking first datanode as the bad one and removes it from > the pipeline. Similar problem exists on DataNode as well and it is fixed in > HADOOP-3339. From HADOOP-3339 : > "The main issue is that BlockReceiver thread (and DataStreamer in the case of > DFSClient) interrupt() the 'responder' thread. But interrupting is a pretty > coarse control. We don't know what state the responder is in and interrupting > has different effects depending on responder state. To fix this properly we > need to redesign how we handle these interactions." > When the first datanode closes its socket from DFSClient, DFSClient should > properly read all the data left in the socket.. Also, DataNode's closing of > the socket should not result in a TCP reset, otherwise I think DFSClient will > not be able to read from the socket. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HDFS-101) DFS write pipeline : DFSClient sometimes does not detect second datanode failure
[ https://issues.apache.org/jira/browse/HDFS-101?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12791804#action_12791804 ] Hairong Kuang commented on HDFS-101: The answer is yes. Thanks for your help testing it, Todd. > DFS write pipeline : DFSClient sometimes does not detect second datanode > failure > - > > Key: HDFS-101 > URL: https://issues.apache.org/jira/browse/HDFS-101 > Project: Hadoop HDFS > Issue Type: Bug >Affects Versions: 0.20.1 >Reporter: Raghu Angadi >Assignee: Hairong Kuang >Priority: Blocker > Fix For: 0.21.0 > > Attachments: detectDownDN.patch > > > When the first datanode's write to second datanode fails or times out > DFSClient ends up marking first datanode as the bad one and removes it from > the pipeline. Similar problem exists on DataNode as well and it is fixed in > HADOOP-3339. From HADOOP-3339 : > "The main issue is that BlockReceiver thread (and DataStreamer in the case of > DFSClient) interrupt() the 'responder' thread. But interrupting is a pretty > coarse control. We don't know what state the responder is in and interrupting > has different effects depending on responder state. To fix this properly we > need to redesign how we handle these interactions." > When the first datanode closes its socket from DFSClient, DFSClient should > properly read all the data left in the socket.. Also, DataNode's closing of > the socket should not result in a TCP reset, otherwise I think DFSClient will > not be able to read from the socket. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HDFS-101) DFS write pipeline : DFSClient sometimes does not detect second datanode failure
[ https://issues.apache.org/jira/browse/HDFS-101?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12791778#action_12791778 ] Todd Lipcon commented on HDFS-101: -- Hi Hairong, Do you anticipate that this will also solve HDFS-795? If so I'll try the patch on my test cluster tomorrow. > DFS write pipeline : DFSClient sometimes does not detect second datanode > failure > - > > Key: HDFS-101 > URL: https://issues.apache.org/jira/browse/HDFS-101 > Project: Hadoop HDFS > Issue Type: Bug >Affects Versions: 0.20.1 >Reporter: Raghu Angadi >Assignee: Hairong Kuang >Priority: Blocker > Fix For: 0.21.0 > > Attachments: detectDownDN.patch > > > When the first datanode's write to second datanode fails or times out > DFSClient ends up marking first datanode as the bad one and removes it from > the pipeline. Similar problem exists on DataNode as well and it is fixed in > HADOOP-3339. From HADOOP-3339 : > "The main issue is that BlockReceiver thread (and DataStreamer in the case of > DFSClient) interrupt() the 'responder' thread. But interrupting is a pretty > coarse control. We don't know what state the responder is in and interrupting > has different effects depending on responder state. To fix this properly we > need to redesign how we handle these interactions." > When the first datanode closes its socket from DFSClient, DFSClient should > properly read all the data left in the socket.. Also, DataNode's closing of > the socket should not result in a TCP reset, otherwise I think DFSClient will > not be able to read from the socket. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HDFS-101) DFS write pipeline : DFSClient sometimes does not detect second datanode failure
[ https://issues.apache.org/jira/browse/HDFS-101?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12788202#action_12788202 ] Hairong Kuang commented on HDFS-101: When I thought more about it, it may make sense to let the client decide how to handle the case when DNi has communication problem with DNi+1 because the client is the one who decides the pipeline recovery policy. DNi itself has no problem receiving packets and storing it to the disk except that it can not talk to DNi+1. I think in this case it is OK that we let DNi continue to run. > DFS write pipeline : DFSClient sometimes does not detect second datanode > failure > - > > Key: HDFS-101 > URL: https://issues.apache.org/jira/browse/HDFS-101 > Project: Hadoop HDFS > Issue Type: Bug >Affects Versions: 0.20.1 >Reporter: Raghu Angadi >Assignee: Hairong Kuang >Priority: Blocker > Fix For: 0.21.0 > > > When the first datanode's write to second datanode fails or times out > DFSClient ends up marking first datanode as the bad one and removes it from > the pipeline. Similar problem exists on DataNode as well and it is fixed in > HADOOP-3339. From HADOOP-3339 : > "The main issue is that BlockReceiver thread (and DataStreamer in the case of > DFSClient) interrupt() the 'responder' thread. But interrupting is a pretty > coarse control. We don't know what state the responder is in and interrupting > has different effects depending on responder state. To fix this properly we > need to redesign how we handle these interactions." > When the first datanode closes its socket from DFSClient, DFSClient should > properly read all the data left in the socket.. Also, DataNode's closing of > the socket should not result in a TCP reset, otherwise I think DFSClient will > not be able to read from the socket. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HDFS-101) DFS write pipeline : DFSClient sometimes does not detect second datanode failure
[ https://issues.apache.org/jira/browse/HDFS-101?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12787880#action_12787880 ] Kan Zhang commented on HDFS-101: > continue to run until DNi-1 or the client closes the connection. You may not want to make the server (DN) depend on the behavior of the client (DFSClient). > DFS write pipeline : DFSClient sometimes does not detect second datanode > failure > - > > Key: HDFS-101 > URL: https://issues.apache.org/jira/browse/HDFS-101 > Project: Hadoop HDFS > Issue Type: Bug >Affects Versions: 0.20.1 >Reporter: Raghu Angadi >Assignee: Hairong Kuang >Priority: Blocker > Fix For: 0.21.0 > > > When the first datanode's write to second datanode fails or times out > DFSClient ends up marking first datanode as the bad one and removes it from > the pipeline. Similar problem exists on DataNode as well and it is fixed in > HADOOP-3339. From HADOOP-3339 : > "The main issue is that BlockReceiver thread (and DataStreamer in the case of > DFSClient) interrupt() the 'responder' thread. But interrupting is a pretty > coarse control. We don't know what state the responder is in and interrupting > has different effects depending on responder state. To fix this properly we > need to redesign how we handle these interactions." > When the first datanode closes its socket from DFSClient, DFSClient should > properly read all the data left in the socket.. Also, DataNode's closing of > the socket should not result in a TCP reset, otherwise I think DFSClient will > not be able to read from the socket. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HDFS-101) DFS write pipeline : DFSClient sometimes does not detect second datanode failure
[ https://issues.apache.org/jira/browse/HDFS-101?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12787177#action_12787177 ] Hairong Kuang commented on HDFS-101: To be more specific, if the block receiver gets an error sending a packet to DNi+1, it still queue the packet to the ack queue, but with a flag "mirrorError" set to be true indicating that the packet has an error mirroring to DNi+1. The block receiver continues to write the packet to disk and then handle the next packet. A packet responder does not exit when detecting DNi+1 has an error. > DFS write pipeline : DFSClient sometimes does not detect second datanode > failure > - > > Key: HDFS-101 > URL: https://issues.apache.org/jira/browse/HDFS-101 > Project: Hadoop HDFS > Issue Type: Bug >Affects Versions: 0.20.1 >Reporter: Raghu Angadi >Assignee: Hairong Kuang >Priority: Blocker > Fix For: 0.21.0 > > > When the first datanode's write to second datanode fails or times out > DFSClient ends up marking first datanode as the bad one and removes it from > the pipeline. Similar problem exists on DataNode as well and it is fixed in > HADOOP-3339. From HADOOP-3339 : > "The main issue is that BlockReceiver thread (and DataStreamer in the case of > DFSClient) interrupt() the 'responder' thread. But interrupting is a pretty > coarse control. We don't know what state the responder is in and interrupting > has different effects depending on responder state. To fix this properly we > need to redesign how we handle these interactions." > When the first datanode closes its socket from DFSClient, DFSClient should > properly read all the data left in the socket.. Also, DataNode's closing of > the socket should not result in a TCP reset, otherwise I think DFSClient will > not be able to read from the socket. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HDFS-101) DFS write pipeline : DFSClient sometimes does not detect second datanode failure
[ https://issues.apache.org/jira/browse/HDFS-101?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12787137#action_12787137 ] Hairong Kuang commented on HDFS-101: Would this work? When DNi detects an error when communicating with DNi+1, it sends an ack indicating that DNi+1 failed and continue to run until DNi-1 or the client closes the connection. > DFS write pipeline : DFSClient sometimes does not detect second datanode > failure > - > > Key: HDFS-101 > URL: https://issues.apache.org/jira/browse/HDFS-101 > Project: Hadoop HDFS > Issue Type: Bug >Affects Versions: 0.20.1 >Reporter: Raghu Angadi >Assignee: Hairong Kuang >Priority: Blocker > Fix For: 0.21.0 > > > When the first datanode's write to second datanode fails or times out > DFSClient ends up marking first datanode as the bad one and removes it from > the pipeline. Similar problem exists on DataNode as well and it is fixed in > HADOOP-3339. From HADOOP-3339 : > "The main issue is that BlockReceiver thread (and DataStreamer in the case of > DFSClient) interrupt() the 'responder' thread. But interrupting is a pretty > coarse control. We don't know what state the responder is in and interrupting > has different effects depending on responder state. To fix this properly we > need to redesign how we handle these interactions." > When the first datanode closes its socket from DFSClient, DFSClient should > properly read all the data left in the socket.. Also, DataNode's closing of > the socket should not result in a TCP reset, otherwise I think DFSClient will > not be able to read from the socket. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HDFS-101) DFS write pipeline : DFSClient sometimes does not detect second datanode failure
[ https://issues.apache.org/jira/browse/HDFS-101?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12787120#action_12787120 ] Hairong Kuang commented on HDFS-101: > When the first datanode closes its socket from DFSClient, DFSClient should > properly read all the data left in the socket.. Kan, thanks for pointing this out, which is a very valid point. I think this applies to data nodes in pipeline as well. > DFS write pipeline : DFSClient sometimes does not detect second datanode > failure > - > > Key: HDFS-101 > URL: https://issues.apache.org/jira/browse/HDFS-101 > Project: Hadoop HDFS > Issue Type: Bug >Affects Versions: 0.20.1 >Reporter: Raghu Angadi >Assignee: Hairong Kuang >Priority: Blocker > Fix For: 0.21.0 > > > When the first datanode's write to second datanode fails or times out > DFSClient ends up marking first datanode as the bad one and removes it from > the pipeline. Similar problem exists on DataNode as well and it is fixed in > HADOOP-3339. From HADOOP-3339 : > "The main issue is that BlockReceiver thread (and DataStreamer in the case of > DFSClient) interrupt() the 'responder' thread. But interrupting is a pretty > coarse control. We don't know what state the responder is in and interrupting > has different effects depending on responder state. To fix this properly we > need to redesign how we handle these interactions." > When the first datanode closes its socket from DFSClient, DFSClient should > properly read all the data left in the socket.. Also, DataNode's closing of > the socket should not result in a TCP reset, otherwise I think DFSClient will > not be able to read from the socket. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HDFS-101) DFS write pipeline : DFSClient sometimes does not detect second datanode failure
[ https://issues.apache.org/jira/browse/HDFS-101?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12787047#action_12787047 ] Kan Zhang commented on HDFS-101: > here is the plan for handling errors detected by DNi This was the approach I took in HDFS-564. However, the problem reported in this JIRA was still seen even after I made the change in h564-24.patch. I suspect the key problem here is the following (as described in the description) and it's orthogonal to how a DN reports its downstream errors. - When the first datanode closes its socket from DFSClient, DFSClient should properly read all the data left in the socket.. Also, DataNode's closing of the socket should not result in a TCP reset, otherwise I think DFSClient will not be able to read from the socket. > DFS write pipeline : DFSClient sometimes does not detect second datanode > failure > - > > Key: HDFS-101 > URL: https://issues.apache.org/jira/browse/HDFS-101 > Project: Hadoop HDFS > Issue Type: Bug >Affects Versions: 0.20.1 >Reporter: Raghu Angadi >Assignee: Hairong Kuang >Priority: Blocker > Fix For: 0.21.0 > > > When the first datanode's write to second datanode fails or times out > DFSClient ends up marking first datanode as the bad one and removes it from > the pipeline. Similar problem exists on DataNode as well and it is fixed in > HADOOP-3339. From HADOOP-3339 : > "The main issue is that BlockReceiver thread (and DataStreamer in the case of > DFSClient) interrupt() the 'responder' thread. But interrupting is a pretty > coarse control. We don't know what state the responder is in and interrupting > has different effects depending on responder state. To fix this properly we > need to redesign how we handle these interactions." > When the first datanode closes its socket from DFSClient, DFSClient should > properly read all the data left in the socket.. Also, DataNode's closing of > the socket should not result in a TCP reset, otherwise I think DFSClient will > not be able to read from the socket. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HDFS-101) DFS write pipeline : DFSClient sometimes does not detect second datanode failure
[ https://issues.apache.org/jira/browse/HDFS-101?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12787030#action_12787030 ] Hairong Kuang commented on HDFS-101: Assume that there is a pipeline consisting of DN0, ..., DNi, ..., where DN0 is the closest to the client, here is the plan for handling errors detected by DNi: 1. If the error occurs when communicating with DNi+1, send an ack indicating that DNi+1 failed and then shut down both block receiver and ack responder. 2. If the error is caused by DNi itself, simply shut down both block receiver and ack responder. Shutting down the block receiver causes the connections to DNi-1 to be closed, therefore DNi-1 will detect that DNi is failed immediately. 3. If the error is caused by DNi-1, handle it the same as 2. Errors may be detected by either the block receiver or the ack responder. No matter who detects it, it needs to notify the other one of the error so the other one will stop and shut down itself. > DFS write pipeline : DFSClient sometimes does not detect second datanode > failure > - > > Key: HDFS-101 > URL: https://issues.apache.org/jira/browse/HDFS-101 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: Raghu Angadi >Assignee: Hairong Kuang >Priority: Blocker > Fix For: 0.21.0 > > > When the first datanode's write to second datanode fails or times out > DFSClient ends up marking first datanode as the bad one and removes it from > the pipeline. Similar problem exists on DataNode as well and it is fixed in > HADOOP-3339. From HADOOP-3339 : > "The main issue is that BlockReceiver thread (and DataStreamer in the case of > DFSClient) interrupt() the 'responder' thread. But interrupting is a pretty > coarse control. We don't know what state the responder is in and interrupting > has different effects depending on responder state. To fix this properly we > need to redesign how we handle these interactions." > When the first datanode closes its socket from DFSClient, DFSClient should > properly read all the data left in the socket.. Also, DataNode's closing of > the socket should not result in a TCP reset, otherwise I think DFSClient will > not be able to read from the socket. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HDFS-101) DFS write pipeline : DFSClient sometimes does not detect second datanode failure
[ https://issues.apache.org/jira/browse/HDFS-101?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12787019#action_12787019 ] Hairong Kuang commented on HDFS-101: > Is this the same as HDFS-795? Yes, the only difference is that HDFS-795 describes the problem in a more general way. > DFS write pipeline : DFSClient sometimes does not detect second datanode > failure > - > > Key: HDFS-101 > URL: https://issues.apache.org/jira/browse/HDFS-101 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: Raghu Angadi >Assignee: Hairong Kuang >Priority: Blocker > Fix For: 0.21.0 > > > When the first datanode's write to second datanode fails or times out > DFSClient ends up marking first datanode as the bad one and removes it from > the pipeline. Similar problem exists on DataNode as well and it is fixed in > HADOOP-3339. From HADOOP-3339 : > "The main issue is that BlockReceiver thread (and DataStreamer in the case of > DFSClient) interrupt() the 'responder' thread. But interrupting is a pretty > coarse control. We don't know what state the responder is in and interrupting > has different effects depending on responder state. To fix this properly we > need to redesign how we handle these interactions." > When the first datanode closes its socket from DFSClient, DFSClient should > properly read all the data left in the socket.. Also, DataNode's closing of > the socket should not result in a TCP reset, otherwise I think DFSClient will > not be able to read from the socket. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HDFS-101) DFS write pipeline : DFSClient sometimes does not detect second datanode failure
[ https://issues.apache.org/jira/browse/HDFS-101?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12783952#action_12783952 ] Todd Lipcon commented on HDFS-101: -- Is this the same as HDFS-795? > DFS write pipeline : DFSClient sometimes does not detect second datanode > failure > - > > Key: HDFS-101 > URL: https://issues.apache.org/jira/browse/HDFS-101 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: Raghu Angadi >Assignee: Hairong Kuang >Priority: Blocker > Fix For: 0.21.0 > > > When the first datanode's write to second datanode fails or times out > DFSClient ends up marking first datanode as the bad one and removes it from > the pipeline. Similar problem exists on DataNode as well and it is fixed in > HADOOP-3339. From HADOOP-3339 : > "The main issue is that BlockReceiver thread (and DataStreamer in the case of > DFSClient) interrupt() the 'responder' thread. But interrupting is a pretty > coarse control. We don't know what state the responder is in and interrupting > has different effects depending on responder state. To fix this properly we > need to redesign how we handle these interactions." > When the first datanode closes its socket from DFSClient, DFSClient should > properly read all the data left in the socket.. Also, DataNode's closing of > the socket should not result in a TCP reset, otherwise I think DFSClient will > not be able to read from the socket. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HDFS-101) DFS write pipeline : DFSClient sometimes does not detect second datanode failure
[ https://issues.apache.org/jira/browse/HDFS-101?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12783858#action_12783858 ] Hairong Kuang commented on HDFS-101: This is not an easy problem to solve. I created HDFS-793 as the first step towards the solution. I will elaborate my plan here while working on HDFS-793. > DFS write pipeline : DFSClient sometimes does not detect second datanode > failure > - > > Key: HDFS-101 > URL: https://issues.apache.org/jira/browse/HDFS-101 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: Raghu Angadi >Assignee: Hairong Kuang >Priority: Blocker > Fix For: 0.21.0 > > > When the first datanode's write to second datanode fails or times out > DFSClient ends up marking first datanode as the bad one and removes it from > the pipeline. Similar problem exists on DataNode as well and it is fixed in > HADOOP-3339. From HADOOP-3339 : > "The main issue is that BlockReceiver thread (and DataStreamer in the case of > DFSClient) interrupt() the 'responder' thread. But interrupting is a pretty > coarse control. We don't know what state the responder is in and interrupting > has different effects depending on responder state. To fix this properly we > need to redesign how we handle these interactions." > When the first datanode closes its socket from DFSClient, DFSClient should > properly read all the data left in the socket.. Also, DataNode's closing of > the socket should not result in a TCP reset, otherwise I think DFSClient will > not be able to read from the socket. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HDFS-101) DFS write pipeline : DFSClient sometimes does not detect second datanode failure
[ https://issues.apache.org/jira/browse/HDFS-101?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12782669#action_12782669 ] Hairong Kuang commented on HDFS-101: It seems to me there are two issues with this problem: 1. If a datanode gets an error receiving a block, it should not simply stop itself. Instead, it should send a failure ack back. 2. The datanode should also identify the source of the error, whether it is caused by itself or the other datanodes in the pipeline, and then report the correct information back. > DFS write pipeline : DFSClient sometimes does not detect second datanode > failure > - > > Key: HDFS-101 > URL: https://issues.apache.org/jira/browse/HDFS-101 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: Raghu Angadi >Assignee: Hairong Kuang >Priority: Blocker > Fix For: 0.21.0 > > > When the first datanode's write to second datanode fails or times out > DFSClient ends up marking first datanode as the bad one and removes it from > the pipeline. Similar problem exists on DataNode as well and it is fixed in > HADOOP-3339. From HADOOP-3339 : > "The main issue is that BlockReceiver thread (and DataStreamer in the case of > DFSClient) interrupt() the 'responder' thread. But interrupting is a pretty > coarse control. We don't know what state the responder is in and interrupting > has different effects depending on responder state. To fix this properly we > need to redesign how we handle these interactions." > When the first datanode closes its socket from DFSClient, DFSClient should > properly read all the data left in the socket.. Also, DataNode's closing of > the socket should not result in a TCP reset, otherwise I think DFSClient will > not be able to read from the socket. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.