[jira] [Commented] (HDFS-101) DFS write pipeline : DFSClient sometimes does not detect second datanode failure

2011-05-03 Thread stack (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-101?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13028607#comment-13028607
 ] 

stack commented on HDFS-101:


Do we know what issue fixed this failing test?  We should backport it to 
branch-0.20-append branch.  Thanks.

> DFS write pipeline : DFSClient sometimes does not detect second datanode 
> failure 
> -
>
> Key: HDFS-101
> URL: https://issues.apache.org/jira/browse/HDFS-101
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: data-node
>Affects Versions: 0.20.1, 0.20-append
>Reporter: Raghu Angadi
>Assignee: Hairong Kuang
>Priority: Blocker
> Fix For: 0.20.2, 0.20-append, 0.21.0
>
> Attachments: HDFS-101_20-append.patch, detectDownDN-0.20.patch, 
> detectDownDN1-0.20.patch, detectDownDN2.patch, 
> detectDownDN3-0.20-yahoo.patch, detectDownDN3-0.20.patch, 
> detectDownDN3.patch, hdfs-101-branch-0.20-append-cdh3.txt, hdfs-101.tar.gz, 
> pipelineHeartbeat.patch, pipelineHeartbeat_yahoo.patch
>
>
> When the first datanode's write to second datanode fails or times out 
> DFSClient ends up marking first datanode as the bad one and removes it from 
> the pipeline. Similar problem exists on DataNode as well and it is fixed in 
> HADOOP-3339. From HADOOP-3339 : 
> "The main issue is that BlockReceiver thread (and DataStreamer in the case of 
> DFSClient) interrupt() the 'responder' thread. But interrupting is a pretty 
> coarse control. We don't know what state the responder is in and interrupting 
> has different effects depending on responder state. To fix this properly we 
> need to redesign how we handle these interactions."
> When the first datanode closes its socket from DFSClient, DFSClient should 
> properly read all the data left in the socket.. Also, DataNode's closing of 
> the socket should not result in a TCP reset, otherwise I think DFSClient will 
> not be able to read from the socket.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HDFS-101) DFS write pipeline : DFSClient sometimes does not detect second datanode failure

2011-05-03 Thread Hari (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-101?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13028588#comment-13028588
 ] 

Hari commented on HDFS-101:
---

Yes , that seems to be the problem . the client expects 3 replies in readfields 
while datanode is sending only 2 .. It is fixed in .21 .. 

> DFS write pipeline : DFSClient sometimes does not detect second datanode 
> failure 
> -
>
> Key: HDFS-101
> URL: https://issues.apache.org/jira/browse/HDFS-101
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: data-node
>Affects Versions: 0.20.1, 0.20-append
>Reporter: Raghu Angadi
>Assignee: Hairong Kuang
>Priority: Blocker
> Fix For: 0.20.2, 0.20-append, 0.21.0
>
> Attachments: HDFS-101_20-append.patch, detectDownDN-0.20.patch, 
> detectDownDN1-0.20.patch, detectDownDN2.patch, 
> detectDownDN3-0.20-yahoo.patch, detectDownDN3-0.20.patch, 
> detectDownDN3.patch, hdfs-101-branch-0.20-append-cdh3.txt, hdfs-101.tar.gz, 
> pipelineHeartbeat.patch, pipelineHeartbeat_yahoo.patch
>
>
> When the first datanode's write to second datanode fails or times out 
> DFSClient ends up marking first datanode as the bad one and removes it from 
> the pipeline. Similar problem exists on DataNode as well and it is fixed in 
> HADOOP-3339. From HADOOP-3339 : 
> "The main issue is that BlockReceiver thread (and DataStreamer in the case of 
> DFSClient) interrupt() the 'responder' thread. But interrupting is a pretty 
> coarse control. We don't know what state the responder is in and interrupting 
> has different effects depending on responder state. To fix this properly we 
> need to redesign how we handle these interactions."
> When the first datanode closes its socket from DFSClient, DFSClient should 
> properly read all the data left in the socket.. Also, DataNode's closing of 
> the socket should not result in a TCP reset, otherwise I think DFSClient will 
> not be able to read from the socket.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HDFS-101) DFS write pipeline : DFSClient sometimes does not detect second datanode failure

2011-04-29 Thread Hairong Kuang (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-101?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13027104#comment-13027104
 ] 

Hairong Kuang commented on HDFS-101:


By simply looking at your client-side log, it seems to me that the datanode 
sent an ack with two fields but the client expects an ack with 3 fields. Could 
you please create a jira and post logs there? Thanks!

> DFS write pipeline : DFSClient sometimes does not detect second datanode 
> failure 
> -
>
> Key: HDFS-101
> URL: https://issues.apache.org/jira/browse/HDFS-101
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: data-node
>Affects Versions: 0.20.1, 0.20-append
>Reporter: Raghu Angadi
>Assignee: Hairong Kuang
>Priority: Blocker
> Fix For: 0.20.2, 0.20-append, 0.21.0
>
> Attachments: HDFS-101_20-append.patch, detectDownDN-0.20.patch, 
> detectDownDN1-0.20.patch, detectDownDN2.patch, 
> detectDownDN3-0.20-yahoo.patch, detectDownDN3-0.20.patch, 
> detectDownDN3.patch, hdfs-101-branch-0.20-append-cdh3.txt, hdfs-101.tar.gz, 
> pipelineHeartbeat.patch, pipelineHeartbeat_yahoo.patch
>
>
> When the first datanode's write to second datanode fails or times out 
> DFSClient ends up marking first datanode as the bad one and removes it from 
> the pipeline. Similar problem exists on DataNode as well and it is fixed in 
> HADOOP-3339. From HADOOP-3339 : 
> "The main issue is that BlockReceiver thread (and DataStreamer in the case of 
> DFSClient) interrupt() the 'responder' thread. But interrupting is a pretty 
> coarse control. We don't know what state the responder is in and interrupting 
> has different effects depending on responder state. To fix this properly we 
> need to redesign how we handle these interactions."
> When the first datanode closes its socket from DFSClient, DFSClient should 
> properly read all the data left in the socket.. Also, DataNode's closing of 
> the socket should not result in a TCP reset, otherwise I think DFSClient will 
> not be able to read from the socket.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HDFS-101) DFS write pipeline : DFSClient sometimes does not detect second datanode failure

2011-04-29 Thread Hari (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-101?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13027003#comment-13027003
 ] 

Hari commented on HDFS-101:
---

TestFileAppend4 is failing randomnly even with this patch in 20-append branch 
.. 

We start 3 dns in the MiniDfsCluster . When the second datanode is killed , the 
first datanode correctly sends the ack as SUCCESS(for itself) FAILED(2nd dn) to 
the client .. But client is getting an EOFException while reading this ack from 
the first dn and hence incorrectly assumes the first dn as bad .. As a result , 
first dn (which is fine) is removed from the pipeline and eventually 2nd dn ll 
also be removed during recovery . now only 1 replica (3rd dn) is remaining for 
the block and the assertion fails .. 

While the 1st dn is fine and is sending ack correctly to the client , it is not 
clear why he is getting EOFException . Is this normal ?? 

The relevant part of the log is : 
"..
11-04-27 14:48:16,796 DEBUG hdfs.DFSClient (DFSClient.java:run(2445)) - 
DataStreamer block blk_-1353652352417823279_1001 wrote packet seqno:-1 size:25 
offsetInBlock:0 lastPacketInBlock:false
2011-04-27 14:48:16,796 DEBUG datanode.DataNode (BlockReceiver.java:run(891)) - 
PacketResponder 2 for block blk_-1353652352417823279_1001 responded an ack: 
Replies for seqno -1 are SUCCESS FAILED
2011-04-27 14:48:16,796 WARN  hdfs.DFSClient (DFSClient.java:run(2596)) - 
DFSOutputStream ResponseProcessor exception  for block 
blk_-1353652352417823279_1001java.io.EOFException
at java.io.DataInputStream.readShort(Unknown Source)
at 
org.apache.hadoop.hdfs.protocol.DataTransferProtocol$PipelineAck.readFields(DataTransferProtocol.java:125)
at 
org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$ResponseProcessor.run(DFSClient.java:2548)

2011-04-27 14:48:16,796 DEBUG datanode.DataNode (BlockReceiver.java:run(789)) - 
PacketResponder 2 seqno = -2 for block blk_-1353652352417823279_1001 waiting 
for local datanode to finish write.
2011-04-27 14:48:16,796 WARN  hdfs.DFSClient 
(DFSClient.java:processDatanodeError(2632)) - Error Recovery for block 
blk_-1353652352417823279_1001 bad datanode[0] 127.0.0.1:4956
"

> DFS write pipeline : DFSClient sometimes does not detect second datanode 
> failure 
> -
>
> Key: HDFS-101
> URL: https://issues.apache.org/jira/browse/HDFS-101
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: data-node
>Affects Versions: 0.20.1, 0.20-append
>Reporter: Raghu Angadi
>Assignee: Hairong Kuang
>Priority: Blocker
> Fix For: 0.20.2, 0.20-append, 0.21.0
>
> Attachments: HDFS-101_20-append.patch, detectDownDN-0.20.patch, 
> detectDownDN1-0.20.patch, detectDownDN2.patch, 
> detectDownDN3-0.20-yahoo.patch, detectDownDN3-0.20.patch, 
> detectDownDN3.patch, hdfs-101-branch-0.20-append-cdh3.txt, hdfs-101.tar.gz, 
> pipelineHeartbeat.patch, pipelineHeartbeat_yahoo.patch
>
>
> When the first datanode's write to second datanode fails or times out 
> DFSClient ends up marking first datanode as the bad one and removes it from 
> the pipeline. Similar problem exists on DataNode as well and it is fixed in 
> HADOOP-3339. From HADOOP-3339 : 
> "The main issue is that BlockReceiver thread (and DataStreamer in the case of 
> DFSClient) interrupt() the 'responder' thread. But interrupting is a pretty 
> coarse control. We don't know what state the responder is in and interrupting 
> has different effects depending on responder state. To fix this properly we 
> need to redesign how we handle these interactions."
> When the first datanode closes its socket from DFSClient, DFSClient should 
> properly read all the data left in the socket.. Also, DataNode's closing of 
> the socket should not result in a TCP reset, otherwise I think DFSClient will 
> not be able to read from the socket.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] Commented: (HDFS-101) DFS write pipeline : DFSClient sometimes does not detect second datanode failure

2010-06-16 Thread Nicolas Spiegelberg (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-101?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12879439#action_12879439
 ] 

Nicolas Spiegelberg commented on HDFS-101:
--

Todd, you're assumption is correct.  I needed a couple small things from the 
HDFS-793 patch (namely, getNumOfReplies) to make HDFS-101 compatible with 
HDFS-872.

> DFS write pipeline : DFSClient sometimes does not detect second datanode 
> failure 
> -
>
> Key: HDFS-101
> URL: https://issues.apache.org/jira/browse/HDFS-101
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 0.20-append, 0.20.1
>Reporter: Raghu Angadi
>Assignee: Hairong Kuang
>Priority: Blocker
> Fix For: 0.20.2, 0.21.0
>
> Attachments: detectDownDN-0.20.patch, detectDownDN1-0.20.patch, 
> detectDownDN2.patch, detectDownDN3-0.20-yahoo.patch, 
> detectDownDN3-0.20.patch, detectDownDN3.patch, hdfs-101.tar.gz, 
> HDFS-101_20-append.patch, pipelineHeartbeat.patch, 
> pipelineHeartbeat_yahoo.patch
>
>
> When the first datanode's write to second datanode fails or times out 
> DFSClient ends up marking first datanode as the bad one and removes it from 
> the pipeline. Similar problem exists on DataNode as well and it is fixed in 
> HADOOP-3339. From HADOOP-3339 : 
> "The main issue is that BlockReceiver thread (and DataStreamer in the case of 
> DFSClient) interrupt() the 'responder' thread. But interrupting is a pretty 
> coarse control. We don't know what state the responder is in and interrupting 
> has different effects depending on responder state. To fix this properly we 
> need to redesign how we handle these interactions."
> When the first datanode closes its socket from DFSClient, DFSClient should 
> properly read all the data left in the socket.. Also, DataNode's closing of 
> the socket should not result in a TCP reset, otherwise I think DFSClient will 
> not be able to read from the socket.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HDFS-101) DFS write pipeline : DFSClient sometimes does not detect second datanode failure

2010-06-08 Thread Todd Lipcon (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-101?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12876800#action_12876800
 ] 

Todd Lipcon commented on HDFS-101:
--

Hey Nicolas. Can you clarify what you mean that HDFS-793 should no longer be 
necessary? You mean that it's not necessary since we already have HDFS-872 
applied to branch? I agree with that. Also, agree that we maintain wire compat 
with these patches.

> DFS write pipeline : DFSClient sometimes does not detect second datanode 
> failure 
> -
>
> Key: HDFS-101
> URL: https://issues.apache.org/jira/browse/HDFS-101
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 0.20.1, 0.20-append
>Reporter: Raghu Angadi
>Assignee: Hairong Kuang
>Priority: Blocker
> Fix For: 0.20.2, 0.21.0
>
> Attachments: detectDownDN-0.20.patch, detectDownDN1-0.20.patch, 
> detectDownDN2.patch, detectDownDN3-0.20-yahoo.patch, 
> detectDownDN3-0.20.patch, detectDownDN3.patch, hdfs-101.tar.gz, 
> HDFS-101_20-append.patch, pipelineHeartbeat.patch, 
> pipelineHeartbeat_yahoo.patch
>
>
> When the first datanode's write to second datanode fails or times out 
> DFSClient ends up marking first datanode as the bad one and removes it from 
> the pipeline. Similar problem exists on DataNode as well and it is fixed in 
> HADOOP-3339. From HADOOP-3339 : 
> "The main issue is that BlockReceiver thread (and DataStreamer in the case of 
> DFSClient) interrupt() the 'responder' thread. But interrupting is a pretty 
> coarse control. We don't know what state the responder is in and interrupting 
> has different effects depending on responder state. To fix this properly we 
> need to redesign how we handle these interactions."
> When the first datanode closes its socket from DFSClient, DFSClient should 
> properly read all the data left in the socket.. Also, DataNode's closing of 
> the socket should not result in a TCP reset, otherwise I think DFSClient will 
> not be able to read from the socket.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HDFS-101) DFS write pipeline : DFSClient sometimes does not detect second datanode failure

2010-03-20 Thread Hairong Kuang (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-101?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12847869#action_12847869
 ] 

Hairong Kuang commented on HDFS-101:


> is this latest patch applicable for branch-20 as well
I do not think that it applies to 0.20. I opened a different jira for 
committing HDFS-101 to 0.20. I will work on it when I have time.

> DFS write pipeline : DFSClient sometimes does not detect second datanode 
> failure 
> -
>
> Key: HDFS-101
> URL: https://issues.apache.org/jira/browse/HDFS-101
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 0.20.1
>Reporter: Raghu Angadi
>Assignee: Hairong Kuang
>Priority: Blocker
> Fix For: 0.20.2, 0.21.0, 0.22.0
>
> Attachments: detectDownDN-0.20.patch, detectDownDN1-0.20.patch, 
> detectDownDN2.patch, detectDownDN3-0.20-yahoo.patch, 
> detectDownDN3-0.20.patch, detectDownDN3.patch, hdfs-101.tar.gz, 
> pipelineHeartbeat.patch, pipelineHeartbeat_yahoo.patch
>
>
> When the first datanode's write to second datanode fails or times out 
> DFSClient ends up marking first datanode as the bad one and removes it from 
> the pipeline. Similar problem exists on DataNode as well and it is fixed in 
> HADOOP-3339. From HADOOP-3339 : 
> "The main issue is that BlockReceiver thread (and DataStreamer in the case of 
> DFSClient) interrupt() the 'responder' thread. But interrupting is a pretty 
> coarse control. We don't know what state the responder is in and interrupting 
> has different effects depending on responder state. To fix this properly we 
> need to redesign how we handle these interactions."
> When the first datanode closes its socket from DFSClient, DFSClient should 
> properly read all the data left in the socket.. Also, DataNode's closing of 
> the socket should not result in a TCP reset, otherwise I think DFSClient will 
> not be able to read from the socket.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HDFS-101) DFS write pipeline : DFSClient sometimes does not detect second datanode failure

2010-03-20 Thread Todd Lipcon (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-101?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12847848#action_12847848
 ] 

Todd Lipcon commented on HDFS-101:
--

Hey Hairong - is this latest patch applicable for branch-20 as well or is it 
unique to the way that HDFS-101 made it into ydist? (I haven't had the time to 
look at it in detail)

> DFS write pipeline : DFSClient sometimes does not detect second datanode 
> failure 
> -
>
> Key: HDFS-101
> URL: https://issues.apache.org/jira/browse/HDFS-101
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 0.20.1
>Reporter: Raghu Angadi
>Assignee: Hairong Kuang
>Priority: Blocker
> Fix For: 0.20.2, 0.21.0, 0.22.0
>
> Attachments: detectDownDN-0.20.patch, detectDownDN1-0.20.patch, 
> detectDownDN2.patch, detectDownDN3-0.20-yahoo.patch, 
> detectDownDN3-0.20.patch, detectDownDN3.patch, hdfs-101.tar.gz, 
> pipelineHeartbeat.patch
>
>
> When the first datanode's write to second datanode fails or times out 
> DFSClient ends up marking first datanode as the bad one and removes it from 
> the pipeline. Similar problem exists on DataNode as well and it is fixed in 
> HADOOP-3339. From HADOOP-3339 : 
> "The main issue is that BlockReceiver thread (and DataStreamer in the case of 
> DFSClient) interrupt() the 'responder' thread. But interrupting is a pretty 
> coarse control. We don't know what state the responder is in and interrupting 
> has different effects depending on responder state. To fix this properly we 
> need to redesign how we handle these interactions."
> When the first datanode closes its socket from DFSClient, DFSClient should 
> properly read all the data left in the socket.. Also, DataNode's closing of 
> the socket should not result in a TCP reset, otherwise I think DFSClient will 
> not be able to read from the socket.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HDFS-101) DFS write pipeline : DFSClient sometimes does not detect second datanode failure

2010-01-19 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-101?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12802582#action_12802582
 ] 

Hudson commented on HDFS-101:
-

Integrated in Hdfs-Patch-h5.grid.sp2.yahoo.net #196 (See 
[http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h5.grid.sp2.yahoo.net/196/])


> DFS write pipeline : DFSClient sometimes does not detect second datanode 
> failure 
> -
>
> Key: HDFS-101
> URL: https://issues.apache.org/jira/browse/HDFS-101
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 0.20.1
>Reporter: Raghu Angadi
>Assignee: Hairong Kuang
>Priority: Blocker
> Fix For: 0.20.2, 0.21.0, 0.22.0
>
> Attachments: detectDownDN-0.20.patch, detectDownDN1-0.20.patch, 
> detectDownDN2.patch, detectDownDN3-0.20.patch, detectDownDN3.patch, 
> hdfs-101.tar.gz
>
>
> When the first datanode's write to second datanode fails or times out 
> DFSClient ends up marking first datanode as the bad one and removes it from 
> the pipeline. Similar problem exists on DataNode as well and it is fixed in 
> HADOOP-3339. From HADOOP-3339 : 
> "The main issue is that BlockReceiver thread (and DataStreamer in the case of 
> DFSClient) interrupt() the 'responder' thread. But interrupting is a pretty 
> coarse control. We don't know what state the responder is in and interrupting 
> has different effects depending on responder state. To fix this properly we 
> need to redesign how we handle these interactions."
> When the first datanode closes its socket from DFSClient, DFSClient should 
> properly read all the data left in the socket.. Also, DataNode's closing of 
> the socket should not result in a TCP reset, otherwise I think DFSClient will 
> not be able to read from the socket.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HDFS-101) DFS write pipeline : DFSClient sometimes does not detect second datanode failure

2010-01-19 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-101?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12802192#action_12802192
 ] 

Hudson commented on HDFS-101:
-

Integrated in Hdfs-Patch-h2.grid.sp2.yahoo.net #99 (See 
[http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h2.grid.sp2.yahoo.net/99/])
Move the change logs of HDFS-793 and  from 0.20 section to 0.21 section


> DFS write pipeline : DFSClient sometimes does not detect second datanode 
> failure 
> -
>
> Key: HDFS-101
> URL: https://issues.apache.org/jira/browse/HDFS-101
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 0.20.1
>Reporter: Raghu Angadi
>Assignee: Hairong Kuang
>Priority: Blocker
> Fix For: 0.20.2, 0.21.0, 0.22.0
>
> Attachments: detectDownDN-0.20.patch, detectDownDN1-0.20.patch, 
> detectDownDN2.patch, detectDownDN3-0.20.patch, detectDownDN3.patch, 
> hdfs-101.tar.gz
>
>
> When the first datanode's write to second datanode fails or times out 
> DFSClient ends up marking first datanode as the bad one and removes it from 
> the pipeline. Similar problem exists on DataNode as well and it is fixed in 
> HADOOP-3339. From HADOOP-3339 : 
> "The main issue is that BlockReceiver thread (and DataStreamer in the case of 
> DFSClient) interrupt() the 'responder' thread. But interrupting is a pretty 
> coarse control. We don't know what state the responder is in and interrupting 
> has different effects depending on responder state. To fix this properly we 
> need to redesign how we handle these interactions."
> When the first datanode closes its socket from DFSClient, DFSClient should 
> properly read all the data left in the socket.. Also, DataNode's closing of 
> the socket should not result in a TCP reset, otherwise I think DFSClient will 
> not be able to read from the socket.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HDFS-101) DFS write pipeline : DFSClient sometimes does not detect second datanode failure

2010-01-16 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-101?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12801169#action_12801169
 ] 

Hudson commented on HDFS-101:
-

Integrated in Hadoop-Hdfs-trunk #202 (See 
[http://hudson.zones.apache.org/hudson/job/Hadoop-Hdfs-trunk/202/])
Move the change logs of HDFS-793 and  from 0.20 section to 0.21 section


> DFS write pipeline : DFSClient sometimes does not detect second datanode 
> failure 
> -
>
> Key: HDFS-101
> URL: https://issues.apache.org/jira/browse/HDFS-101
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 0.20.1
>Reporter: Raghu Angadi
>Assignee: Hairong Kuang
>Priority: Blocker
> Fix For: 0.20.2, 0.21.0, 0.22.0
>
> Attachments: detectDownDN-0.20.patch, detectDownDN1-0.20.patch, 
> detectDownDN2.patch, detectDownDN3-0.20.patch, detectDownDN3.patch, 
> hdfs-101.tar.gz
>
>
> When the first datanode's write to second datanode fails or times out 
> DFSClient ends up marking first datanode as the bad one and removes it from 
> the pipeline. Similar problem exists on DataNode as well and it is fixed in 
> HADOOP-3339. From HADOOP-3339 : 
> "The main issue is that BlockReceiver thread (and DataStreamer in the case of 
> DFSClient) interrupt() the 'responder' thread. But interrupting is a pretty 
> coarse control. We don't know what state the responder is in and interrupting 
> has different effects depending on responder state. To fix this properly we 
> need to redesign how we handle these interactions."
> When the first datanode closes its socket from DFSClient, DFSClient should 
> properly read all the data left in the socket.. Also, DataNode's closing of 
> the socket should not result in a TCP reset, otherwise I think DFSClient will 
> not be able to read from the socket.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HDFS-101) DFS write pipeline : DFSClient sometimes does not detect second datanode failure

2010-01-15 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-101?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12800855#action_12800855
 ] 

Hudson commented on HDFS-101:
-

Integrated in Hadoop-Hdfs-trunk-Commit #172 (See 
[http://hudson.zones.apache.org/hudson/job/Hadoop-Hdfs-trunk-Commit/172/])
Move the change logs of HDFS-793 and  from 0.20 section to 0.21 section


> DFS write pipeline : DFSClient sometimes does not detect second datanode 
> failure 
> -
>
> Key: HDFS-101
> URL: https://issues.apache.org/jira/browse/HDFS-101
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 0.20.1
>Reporter: Raghu Angadi
>Assignee: Hairong Kuang
>Priority: Blocker
> Fix For: 0.20.2, 0.21.0, 0.22.0
>
> Attachments: detectDownDN-0.20.patch, detectDownDN1-0.20.patch, 
> detectDownDN2.patch, detectDownDN3-0.20.patch, detectDownDN3.patch, 
> hdfs-101.tar.gz
>
>
> When the first datanode's write to second datanode fails or times out 
> DFSClient ends up marking first datanode as the bad one and removes it from 
> the pipeline. Similar problem exists on DataNode as well and it is fixed in 
> HADOOP-3339. From HADOOP-3339 : 
> "The main issue is that BlockReceiver thread (and DataStreamer in the case of 
> DFSClient) interrupt() the 'responder' thread. But interrupting is a pretty 
> coarse control. We don't know what state the responder is in and interrupting 
> has different effects depending on responder state. To fix this properly we 
> need to redesign how we handle these interactions."
> When the first datanode closes its socket from DFSClient, DFSClient should 
> properly read all the data left in the socket.. Also, DataNode's closing of 
> the socket should not result in a TCP reset, otherwise I think DFSClient will 
> not be able to read from the socket.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HDFS-101) DFS write pipeline : DFSClient sometimes does not detect second datanode failure

2010-01-11 Thread Alex Loddengaard (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-101?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12799019#action_12799019
 ] 

Alex Loddengaard commented on HDFS-101:
---

I will be out of the office Thursday, 1/7, through Wednesday, 1/13,
back in the office Thursday, 1/14.  I will be checking email fairly
consistently in the evenings.

Please contact Christophe Bisciglia (christo...@cloudera.com) with any
support or training emergencies.  Otherwise, you'll hear from me soon.

Thanks,

Alex


> DFS write pipeline : DFSClient sometimes does not detect second datanode 
> failure 
> -
>
> Key: HDFS-101
> URL: https://issues.apache.org/jira/browse/HDFS-101
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 0.20.1
>Reporter: Raghu Angadi
>Assignee: Hairong Kuang
>Priority: Blocker
> Fix For: 0.20.2, 0.21.0, 0.22.0
>
> Attachments: detectDownDN-0.20.patch, detectDownDN1-0.20.patch, 
> detectDownDN2.patch, detectDownDN3-0.20.patch, detectDownDN3.patch, 
> hdfs-101.tar.gz
>
>
> When the first datanode's write to second datanode fails or times out 
> DFSClient ends up marking first datanode as the bad one and removes it from 
> the pipeline. Similar problem exists on DataNode as well and it is fixed in 
> HADOOP-3339. From HADOOP-3339 : 
> "The main issue is that BlockReceiver thread (and DataStreamer in the case of 
> DFSClient) interrupt() the 'responder' thread. But interrupting is a pretty 
> coarse control. We don't know what state the responder is in and interrupting 
> has different effects depending on responder state. To fix this properly we 
> need to redesign how we handle these interactions."
> When the first datanode closes its socket from DFSClient, DFSClient should 
> properly read all the data left in the socket.. Also, DataNode's closing of 
> the socket should not result in a TCP reset, otherwise I think DFSClient will 
> not be able to read from the socket.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HDFS-101) DFS write pipeline : DFSClient sometimes does not detect second datanode failure

2010-01-11 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-101?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12798992#action_12798992
 ] 

Hudson commented on HDFS-101:
-

Integrated in Hdfs-Patch-h2.grid.sp2.yahoo.net #94 (See 
[http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h2.grid.sp2.yahoo.net/94/])


> DFS write pipeline : DFSClient sometimes does not detect second datanode 
> failure 
> -
>
> Key: HDFS-101
> URL: https://issues.apache.org/jira/browse/HDFS-101
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 0.20.1
>Reporter: Raghu Angadi
>Assignee: Hairong Kuang
>Priority: Blocker
> Fix For: 0.20.2, 0.21.0, 0.22.0
>
> Attachments: detectDownDN-0.20.patch, detectDownDN1-0.20.patch, 
> detectDownDN2.patch, detectDownDN3-0.20.patch, detectDownDN3.patch, 
> hdfs-101.tar.gz
>
>
> When the first datanode's write to second datanode fails or times out 
> DFSClient ends up marking first datanode as the bad one and removes it from 
> the pipeline. Similar problem exists on DataNode as well and it is fixed in 
> HADOOP-3339. From HADOOP-3339 : 
> "The main issue is that BlockReceiver thread (and DataStreamer in the case of 
> DFSClient) interrupt() the 'responder' thread. But interrupting is a pretty 
> coarse control. We don't know what state the responder is in and interrupting 
> has different effects depending on responder state. To fix this properly we 
> need to redesign how we handle these interactions."
> When the first datanode closes its socket from DFSClient, DFSClient should 
> properly read all the data left in the socket.. Also, DataNode's closing of 
> the socket should not result in a TCP reset, otherwise I think DFSClient will 
> not be able to read from the socket.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HDFS-101) DFS write pipeline : DFSClient sometimes does not detect second datanode failure

2009-12-26 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-101?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12794610#action_12794610
 ] 

Hudson commented on HDFS-101:
-

Integrated in Hadoop-Hdfs-trunk #182 (See 
[http://hudson.zones.apache.org/hudson/job/Hadoop-Hdfs-trunk/182/])


> DFS write pipeline : DFSClient sometimes does not detect second datanode 
> failure 
> -
>
> Key: HDFS-101
> URL: https://issues.apache.org/jira/browse/HDFS-101
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 0.20.1
>Reporter: Raghu Angadi
>Assignee: Hairong Kuang
>Priority: Blocker
> Fix For: 0.20.2, 0.21.0, 0.22.0
>
> Attachments: detectDownDN-0.20.patch, detectDownDN1-0.20.patch, 
> detectDownDN2.patch, detectDownDN3-0.20.patch, detectDownDN3.patch, 
> hdfs-101.tar.gz
>
>
> When the first datanode's write to second datanode fails or times out 
> DFSClient ends up marking first datanode as the bad one and removes it from 
> the pipeline. Similar problem exists on DataNode as well and it is fixed in 
> HADOOP-3339. From HADOOP-3339 : 
> "The main issue is that BlockReceiver thread (and DataStreamer in the case of 
> DFSClient) interrupt() the 'responder' thread. But interrupting is a pretty 
> coarse control. We don't know what state the responder is in and interrupting 
> has different effects depending on responder state. To fix this properly we 
> need to redesign how we handle these interactions."
> When the first datanode closes its socket from DFSClient, DFSClient should 
> properly read all the data left in the socket.. Also, DataNode's closing of 
> the socket should not result in a TCP reset, otherwise I think DFSClient will 
> not be able to read from the socket.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HDFS-101) DFS write pipeline : DFSClient sometimes does not detect second datanode failure

2009-12-23 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-101?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12794348#action_12794348
 ] 

Hudson commented on HDFS-101:
-

Integrated in Hdfs-Patch-h5.grid.sp2.yahoo.net #159 (See 
[http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h5.grid.sp2.yahoo.net/159/])


> DFS write pipeline : DFSClient sometimes does not detect second datanode 
> failure 
> -
>
> Key: HDFS-101
> URL: https://issues.apache.org/jira/browse/HDFS-101
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 0.20.1
>Reporter: Raghu Angadi
>Assignee: Hairong Kuang
>Priority: Blocker
> Fix For: 0.20.2, 0.21.0, 0.22.0
>
> Attachments: detectDownDN-0.20.patch, detectDownDN1-0.20.patch, 
> detectDownDN2.patch, detectDownDN3-0.20.patch, detectDownDN3.patch, 
> hdfs-101.tar.gz
>
>
> When the first datanode's write to second datanode fails or times out 
> DFSClient ends up marking first datanode as the bad one and removes it from 
> the pipeline. Similar problem exists on DataNode as well and it is fixed in 
> HADOOP-3339. From HADOOP-3339 : 
> "The main issue is that BlockReceiver thread (and DataStreamer in the case of 
> DFSClient) interrupt() the 'responder' thread. But interrupting is a pretty 
> coarse control. We don't know what state the responder is in and interrupting 
> has different effects depending on responder state. To fix this properly we 
> need to redesign how we handle these interactions."
> When the first datanode closes its socket from DFSClient, DFSClient should 
> properly read all the data left in the socket.. Also, DataNode's closing of 
> the socket should not result in a TCP reset, otherwise I think DFSClient will 
> not be able to read from the socket.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HDFS-101) DFS write pipeline : DFSClient sometimes does not detect second datanode failure

2009-12-21 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-101?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12793393#action_12793393
 ] 

Hudson commented on HDFS-101:
-

Integrated in Hadoop-Hdfs-trunk-Commit #152 (See 
[http://hudson.zones.apache.org/hudson/job/Hadoop-Hdfs-trunk-Commit/152/])
. DFS write pipeline: DFSClient sometimes does not detect second datanode 
failure. Contributed by Hairong Kuang.


> DFS write pipeline : DFSClient sometimes does not detect second datanode 
> failure 
> -
>
> Key: HDFS-101
> URL: https://issues.apache.org/jira/browse/HDFS-101
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 0.20.1
>Reporter: Raghu Angadi
>Assignee: Hairong Kuang
>Priority: Blocker
> Fix For: 0.21.0
>
> Attachments: detectDownDN-0.20.patch, detectDownDN1-0.20.patch, 
> detectDownDN2.patch, detectDownDN3-0.20.patch, detectDownDN3.patch, 
> hdfs-101.tar.gz
>
>
> When the first datanode's write to second datanode fails or times out 
> DFSClient ends up marking first datanode as the bad one and removes it from 
> the pipeline. Similar problem exists on DataNode as well and it is fixed in 
> HADOOP-3339. From HADOOP-3339 : 
> "The main issue is that BlockReceiver thread (and DataStreamer in the case of 
> DFSClient) interrupt() the 'responder' thread. But interrupting is a pretty 
> coarse control. We don't know what state the responder is in and interrupting 
> has different effects depending on responder state. To fix this properly we 
> need to redesign how we handle these interactions."
> When the first datanode closes its socket from DFSClient, DFSClient should 
> properly read all the data left in the socket.. Also, DataNode's closing of 
> the socket should not result in a TCP reset, otherwise I think DFSClient will 
> not be able to read from the socket.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HDFS-101) DFS write pipeline : DFSClient sometimes does not detect second datanode failure

2009-12-21 Thread Tsz Wo (Nicholas), SZE (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-101?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12793372#action_12793372
 ] 

Tsz Wo (Nicholas), SZE commented on HDFS-101:
-

Hi Todd, thank you for testing it.  Could you post the log of DN .82 which 
shows the DiskOutOfSpaceException stack trace in [your test 
cases|https://issues.apache.org/jira/browse/HDFS-101?focusedCommentId=12792245&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12792245]?
  We may add new unit tests for your test cases.

> DFS write pipeline : DFSClient sometimes does not detect second datanode 
> failure 
> -
>
> Key: HDFS-101
> URL: https://issues.apache.org/jira/browse/HDFS-101
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 0.20.1
>Reporter: Raghu Angadi
>Assignee: Hairong Kuang
>Priority: Blocker
> Fix For: 0.21.0
>
> Attachments: detectDownDN-0.20.patch, detectDownDN1-0.20.patch, 
> detectDownDN2.patch, detectDownDN3-0.20.patch, detectDownDN3.patch, 
> hdfs-101.tar.gz
>
>
> When the first datanode's write to second datanode fails or times out 
> DFSClient ends up marking first datanode as the bad one and removes it from 
> the pipeline. Similar problem exists on DataNode as well and it is fixed in 
> HADOOP-3339. From HADOOP-3339 : 
> "The main issue is that BlockReceiver thread (and DataStreamer in the case of 
> DFSClient) interrupt() the 'responder' thread. But interrupting is a pretty 
> coarse control. We don't know what state the responder is in and interrupting 
> has different effects depending on responder state. To fix this properly we 
> need to redesign how we handle these interactions."
> When the first datanode closes its socket from DFSClient, DFSClient should 
> properly read all the data left in the socket.. Also, DataNode's closing of 
> the socket should not result in a TCP reset, otherwise I think DFSClient will 
> not be able to read from the socket.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HDFS-101) DFS write pipeline : DFSClient sometimes does not detect second datanode failure

2009-12-18 Thread dhruba borthakur (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-101?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12792677#action_12792677
 ] 

dhruba borthakur commented on HDFS-101:
---

+1 Code looks good. The unit test will be difficult to write. I will look at 
the unit test when you post it.

> DFS write pipeline : DFSClient sometimes does not detect second datanode 
> failure 
> -
>
> Key: HDFS-101
> URL: https://issues.apache.org/jira/browse/HDFS-101
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 0.20.1
>Reporter: Raghu Angadi
>Assignee: Hairong Kuang
>Priority: Blocker
> Fix For: 0.21.0
>
> Attachments: detectDownDN-0.20.patch, detectDownDN1-0.20.patch, 
> detectDownDN2.patch, hdfs-101.tar.gz
>
>
> When the first datanode's write to second datanode fails or times out 
> DFSClient ends up marking first datanode as the bad one and removes it from 
> the pipeline. Similar problem exists on DataNode as well and it is fixed in 
> HADOOP-3339. From HADOOP-3339 : 
> "The main issue is that BlockReceiver thread (and DataStreamer in the case of 
> DFSClient) interrupt() the 'responder' thread. But interrupting is a pretty 
> coarse control. We don't know what state the responder is in and interrupting 
> has different effects depending on responder state. To fix this properly we 
> need to redesign how we handle these interactions."
> When the first datanode closes its socket from DFSClient, DFSClient should 
> properly read all the data left in the socket.. Also, DataNode's closing of 
> the socket should not result in a TCP reset, otherwise I think DFSClient will 
> not be able to read from the socket.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HDFS-101) DFS write pipeline : DFSClient sometimes does not detect second datanode failure

2009-12-18 Thread Hairong Kuang (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-101?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12792572#action_12792572
 ] 

Hairong Kuang commented on HDFS-101:


> Do you know of a good way to manually trigger this?
Increase a file's replication factor from 1 to 3 will trigger this. But I do 
not think my patch changes any replication behavior because IOException is not 
thrown in case of block replication.

> DFS write pipeline : DFSClient sometimes does not detect second datanode 
> failure 
> -
>
> Key: HDFS-101
> URL: https://issues.apache.org/jira/browse/HDFS-101
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 0.20.1
>Reporter: Raghu Angadi
>Assignee: Hairong Kuang
>Priority: Blocker
> Fix For: 0.21.0
>
> Attachments: detectDownDN-0.20.patch, detectDownDN1-0.20.patch, 
> detectDownDN2.patch, hdfs-101.tar.gz
>
>
> When the first datanode's write to second datanode fails or times out 
> DFSClient ends up marking first datanode as the bad one and removes it from 
> the pipeline. Similar problem exists on DataNode as well and it is fixed in 
> HADOOP-3339. From HADOOP-3339 : 
> "The main issue is that BlockReceiver thread (and DataStreamer in the case of 
> DFSClient) interrupt() the 'responder' thread. But interrupting is a pretty 
> coarse control. We don't know what state the responder is in and interrupting 
> has different effects depending on responder state. To fix this properly we 
> need to redesign how we handle these interactions."
> When the first datanode closes its socket from DFSClient, DFSClient should 
> properly read all the data left in the socket.. Also, DataNode's closing of 
> the socket should not result in a TCP reset, otherwise I think DFSClient will 
> not be able to read from the socket.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HDFS-101) DFS write pipeline : DFSClient sometimes does not detect second datanode failure

2009-12-18 Thread dhruba borthakur (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-101?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12792473#action_12792473
 ] 

dhruba borthakur commented on HDFS-101:
---

you are right. transferBlocks could take multiple targets. That means 
clientName.len could be zero while at the same time mirror can be non-null. 

> DFS write pipeline : DFSClient sometimes does not detect second datanode 
> failure 
> -
>
> Key: HDFS-101
> URL: https://issues.apache.org/jira/browse/HDFS-101
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 0.20.1
>Reporter: Raghu Angadi
>Assignee: Hairong Kuang
>Priority: Blocker
> Fix For: 0.21.0
>
> Attachments: detectDownDN-0.20.patch, detectDownDN1-0.20.patch, 
> detectDownDN2.patch, hdfs-101.tar.gz
>
>
> When the first datanode's write to second datanode fails or times out 
> DFSClient ends up marking first datanode as the bad one and removes it from 
> the pipeline. Similar problem exists on DataNode as well and it is fixed in 
> HADOOP-3339. From HADOOP-3339 : 
> "The main issue is that BlockReceiver thread (and DataStreamer in the case of 
> DFSClient) interrupt() the 'responder' thread. But interrupting is a pretty 
> coarse control. We don't know what state the responder is in and interrupting 
> has different effects depending on responder state. To fix this properly we 
> need to redesign how we handle these interactions."
> When the first datanode closes its socket from DFSClient, DFSClient should 
> properly read all the data left in the socket.. Also, DataNode's closing of 
> the socket should not result in a TCP reset, otherwise I think DFSClient will 
> not be able to read from the socket.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HDFS-101) DFS write pipeline : DFSClient sometimes does not detect second datanode failure

2009-12-17 Thread Todd Lipcon (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-101?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12792375#action_12792375
 ] 

Todd Lipcon commented on HDFS-101:
--

Ah, I never realized that transferBlocks could take multiple targets.

Do you know of a good way to manually trigger this? Would be good to verify 
that it's still working well in branch-20 after this patch, if it's not too 
hard. (or, is it covered by unit tests?)

> DFS write pipeline : DFSClient sometimes does not detect second datanode 
> failure 
> -
>
> Key: HDFS-101
> URL: https://issues.apache.org/jira/browse/HDFS-101
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 0.20.1
>Reporter: Raghu Angadi
>Assignee: Hairong Kuang
>Priority: Blocker
> Fix For: 0.21.0
>
> Attachments: detectDownDN-0.20.patch, detectDownDN1-0.20.patch, 
> detectDownDN2.patch, hdfs-101.tar.gz
>
>
> When the first datanode's write to second datanode fails or times out 
> DFSClient ends up marking first datanode as the bad one and removes it from 
> the pipeline. Similar problem exists on DataNode as well and it is fixed in 
> HADOOP-3339. From HADOOP-3339 : 
> "The main issue is that BlockReceiver thread (and DataStreamer in the case of 
> DFSClient) interrupt() the 'responder' thread. But interrupting is a pretty 
> coarse control. We don't know what state the responder is in and interrupting 
> has different effects depending on responder state. To fix this properly we 
> need to redesign how we handle these interactions."
> When the first datanode closes its socket from DFSClient, DFSClient should 
> properly read all the data left in the socket.. Also, DataNode's closing of 
> the socket should not result in a TCP reset, otherwise I think DFSClient will 
> not be able to read from the socket.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HDFS-101) DFS write pipeline : DFSClient sometimes does not detect second datanode failure

2009-12-17 Thread Hairong Kuang (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-101?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12792373#action_12792373
 ] 

Hairong Kuang commented on HDFS-101:


Todd and Dhruba, thanks a lot for your testing and review.

> replication request always have a pipeline of size 1.
I do not think this is true. A replication pipeline could have a size which is 
greater than 1. So it is possible that client length == 0 but mirror != null.


> DFS write pipeline : DFSClient sometimes does not detect second datanode 
> failure 
> -
>
> Key: HDFS-101
> URL: https://issues.apache.org/jira/browse/HDFS-101
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 0.20.1
>Reporter: Raghu Angadi
>Assignee: Hairong Kuang
>Priority: Blocker
> Fix For: 0.21.0
>
> Attachments: detectDownDN-0.20.patch, detectDownDN1-0.20.patch, 
> detectDownDN2.patch, hdfs-101.tar.gz
>
>
> When the first datanode's write to second datanode fails or times out 
> DFSClient ends up marking first datanode as the bad one and removes it from 
> the pipeline. Similar problem exists on DataNode as well and it is fixed in 
> HADOOP-3339. From HADOOP-3339 : 
> "The main issue is that BlockReceiver thread (and DataStreamer in the case of 
> DFSClient) interrupt() the 'responder' thread. But interrupting is a pretty 
> coarse control. We don't know what state the responder is in and interrupting 
> has different effects depending on responder state. To fix this properly we 
> need to redesign how we handle these interactions."
> When the first datanode closes its socket from DFSClient, DFSClient should 
> properly read all the data left in the socket.. Also, DataNode's closing of 
> the socket should not result in a TCP reset, otherwise I think DFSClient will 
> not be able to read from the socket.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HDFS-101) DFS write pipeline : DFSClient sometimes does not detect second datanode failure

2009-12-17 Thread dhruba borthakur (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-101?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12792358#action_12792358
 ] 

dhruba borthakur commented on HDFS-101:
---

I am looking at the patch too. Looks good at first sight.

> clientName.len == 0 means that this is a block copy for replication. It has 
> nothing to do if this is the last DN in pipeline or not.

I agree. To elaborate,  replication request always have a  pipeline of size 1. 
That means that isn't any mirror in this case.

> DFS write pipeline : DFSClient sometimes does not detect second datanode 
> failure 
> -
>
> Key: HDFS-101
> URL: https://issues.apache.org/jira/browse/HDFS-101
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 0.20.1
>Reporter: Raghu Angadi
>Assignee: Hairong Kuang
>Priority: Blocker
> Fix For: 0.21.0
>
> Attachments: detectDownDN-0.20.patch, detectDownDN1-0.20.patch, 
> detectDownDN2.patch, hdfs-101.tar.gz
>
>
> When the first datanode's write to second datanode fails or times out 
> DFSClient ends up marking first datanode as the bad one and removes it from 
> the pipeline. Similar problem exists on DataNode as well and it is fixed in 
> HADOOP-3339. From HADOOP-3339 : 
> "The main issue is that BlockReceiver thread (and DataStreamer in the case of 
> DFSClient) interrupt() the 'responder' thread. But interrupting is a pretty 
> coarse control. We don't know what state the responder is in and interrupting 
> has different effects depending on responder state. To fix this properly we 
> need to redesign how we handle these interactions."
> When the first datanode closes its socket from DFSClient, DFSClient should 
> properly read all the data left in the socket.. Also, DataNode's closing of 
> the socket should not result in a TCP reset, otherwise I think DFSClient will 
> not be able to read from the socket.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HDFS-101) DFS write pipeline : DFSClient sometimes does not detect second datanode failure

2009-12-17 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-101?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12792287#action_12792287
 ] 

Hadoop QA commented on HDFS-101:


-1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12428366/detectDownDN1.patch
  against trunk revision 891593.

+1 @author.  The patch does not contain any @author tags.

-1 tests included.  The patch doesn't appear to include any new or modified 
tests.
Please justify why no new tests are needed for this 
patch.
Also please list what manual steps were performed to 
verify this patch.

+1 javadoc.  The javadoc tool did not generate any warning messages.

+1 javac.  The applied patch does not increase the total number of javac 
compiler warnings.

+1 findbugs.  The patch does not introduce any new Findbugs warnings.

+1 release audit.  The applied patch does not increase the total number of 
release audit warnings.

+1 core tests.  The patch passed core unit tests.

+1 contrib tests.  The patch passed contrib unit tests.

Test results: 
http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h5.grid.sp2.yahoo.net/150/testReport/
Findbugs warnings: 
http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h5.grid.sp2.yahoo.net/150/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Checkstyle results: 
http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h5.grid.sp2.yahoo.net/150/artifact/trunk/build/test/checkstyle-errors.html
Console output: 
http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h5.grid.sp2.yahoo.net/150/console

This message is automatically generated.

> DFS write pipeline : DFSClient sometimes does not detect second datanode 
> failure 
> -
>
> Key: HDFS-101
> URL: https://issues.apache.org/jira/browse/HDFS-101
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 0.20.1
>Reporter: Raghu Angadi
>Assignee: Hairong Kuang
>Priority: Blocker
> Fix For: 0.21.0
>
> Attachments: detectDownDN-0.20.patch, detectDownDN1-0.20.patch, 
> detectDownDN2.patch, hdfs-101.tar.gz
>
>
> When the first datanode's write to second datanode fails or times out 
> DFSClient ends up marking first datanode as the bad one and removes it from 
> the pipeline. Similar problem exists on DataNode as well and it is fixed in 
> HADOOP-3339. From HADOOP-3339 : 
> "The main issue is that BlockReceiver thread (and DataStreamer in the case of 
> DFSClient) interrupt() the 'responder' thread. But interrupting is a pretty 
> coarse control. We don't know what state the responder is in and interrupting 
> has different effects depending on responder state. To fix this properly we 
> need to redesign how we handle these interactions."
> When the first datanode closes its socket from DFSClient, DFSClient should 
> properly read all the data left in the socket.. Also, DataNode's closing of 
> the socket should not result in a TCP reset, otherwise I think DFSClient will 
> not be able to read from the socket.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HDFS-101) DFS write pipeline : DFSClient sometimes does not detect second datanode failure

2009-12-17 Thread Todd Lipcon (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-101?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12792285#action_12792285
 ] 

Todd Lipcon commented on HDFS-101:
--

Just applied 
https://issues.apache.org/jira/secure/attachment/12428383/detectDownDN1-0.20.patch
 and tested on the cluster. I think the other error I mentioned above is just 
HDFS-630, since I'm testing on 0.20 on a 3-node cluster, so +1 on this patch.

bq. clientName.len == 0 means that this is a block copy for replication. It has 
nothing to do if this is the last DN in pipeline or not.

Right, but my question is whether clientName.len can ever be 0 when there's a 
mirror. My belief is no. Perhaps it's worth an assert there (since we're now 
cool with assertions in HDFS)

> DFS write pipeline : DFSClient sometimes does not detect second datanode 
> failure 
> -
>
> Key: HDFS-101
> URL: https://issues.apache.org/jira/browse/HDFS-101
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 0.20.1
>Reporter: Raghu Angadi
>Assignee: Hairong Kuang
>Priority: Blocker
> Fix For: 0.21.0
>
> Attachments: detectDownDN-0.20.patch, detectDownDN1-0.20.patch, 
> detectDownDN2.patch, hdfs-101.tar.gz
>
>
> When the first datanode's write to second datanode fails or times out 
> DFSClient ends up marking first datanode as the bad one and removes it from 
> the pipeline. Similar problem exists on DataNode as well and it is fixed in 
> HADOOP-3339. From HADOOP-3339 : 
> "The main issue is that BlockReceiver thread (and DataStreamer in the case of 
> DFSClient) interrupt() the 'responder' thread. But interrupting is a pretty 
> coarse control. We don't know what state the responder is in and interrupting 
> has different effects depending on responder state. To fix this properly we 
> need to redesign how we handle these interactions."
> When the first datanode closes its socket from DFSClient, DFSClient should 
> properly read all the data left in the socket.. Also, DataNode's closing of 
> the socket should not result in a TCP reset, otherwise I think DFSClient will 
> not be able to read from the socket.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HDFS-101) DFS write pipeline : DFSClient sometimes does not detect second datanode failure

2009-12-17 Thread Todd Lipcon (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-101?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12792275#action_12792275
 ] 

Todd Lipcon commented on HDFS-101:
--

err, sorry, correction to above. I forcibly killed the DN on *10.251.66.212*. 
So the detection of down node was correct, it was just failure recovery that 
was problematic. This might be related to HDFS-630.

> DFS write pipeline : DFSClient sometimes does not detect second datanode 
> failure 
> -
>
> Key: HDFS-101
> URL: https://issues.apache.org/jira/browse/HDFS-101
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 0.20.1
>Reporter: Raghu Angadi
>Assignee: Hairong Kuang
>Priority: Blocker
> Fix For: 0.21.0
>
> Attachments: detectDownDN-0.20.patch, detectDownDN.patch, 
> detectDownDN1-0.20.patch, detectDownDN1.patch, hdfs-101.tar.gz
>
>
> When the first datanode's write to second datanode fails or times out 
> DFSClient ends up marking first datanode as the bad one and removes it from 
> the pipeline. Similar problem exists on DataNode as well and it is fixed in 
> HADOOP-3339. From HADOOP-3339 : 
> "The main issue is that BlockReceiver thread (and DataStreamer in the case of 
> DFSClient) interrupt() the 'responder' thread. But interrupting is a pretty 
> coarse control. We don't know what state the responder is in and interrupting 
> has different effects depending on responder state. To fix this properly we 
> need to redesign how we handle these interactions."
> When the first datanode closes its socket from DFSClient, DFSClient should 
> properly read all the data left in the socket.. Also, DataNode's closing of 
> the socket should not result in a TCP reset, otherwise I think DFSClient will 
> not be able to read from the socket.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HDFS-101) DFS write pipeline : DFSClient sometimes does not detect second datanode failure

2009-12-17 Thread Todd Lipcon (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-101?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12792274#action_12792274
 ] 

Todd Lipcon commented on HDFS-101:
--

As a second test of the above modification, I started uploading a 1G file, then 
forceably killed the DN on 10.250.7.148:

09/12/17 20:14:53 WARN hdfs.DFSClient: DFSOutputStream ResponseProcessor 
exception  for block blk_-8026763677133524198_1407java.io.IOException: Bad 
response 1 for block blk_-8026763677133524198_1407 from datanode 
10.251.66.212:50010
at 
org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$ResponseProcessor.run(DFSClient.java:2427)

09/12/17 20:14:53 WARN hdfs.DFSClient: Error Recovery for block 
blk_-8026763677133524198_1407 bad datanode[2] 10.251.66.212:50010
09/12/17 20:14:53 WARN hdfs.DFSClient: Error Recovery for block 
blk_-8026763677133524198_1407 in pipeline 10.250.7.148:50010, 
10.251.43.82:50010, 10.251.66.212:50010: bad datanode 10.251.66.212:50010
09/12/17 20:14:54 INFO hdfs.DFSClient: Exception in createBlockOutputStream 
java.io.IOException: Bad connect ack with firstBadLink 10.251.66.212:50010
09/12/17 20:14:54 INFO hdfs.DFSClient: Abandoning block 
blk_-3750676278765626865_1408
09/12/17 20:15:00 INFO hdfs.DFSClient: Exception in createBlockOutputStream 
java.io.IOException: Bad connect ack with firstBadLink 10.251.66.212:50010
09/12/17 20:15:00 INFO hdfs.DFSClient: Abandoning block 
blk_7561780221358446528_1408
09/12/17 20:15:06 INFO hdfs.DFSClient: Exception in createBlockOutputStream 
java.io.IOException: Bad connect ack with firstBadLink 10.251.66.212:50010
09/12/17 20:15:06 INFO hdfs.DFSClient: Abandoning block 
blk_-8059177057921476468_1408
09/12/17 20:15:12 INFO hdfs.DFSClient: Exception in createBlockOutputStream 
java.net.ConnectException: Connection refused
09/12/17 20:15:12 INFO hdfs.DFSClient: Abandoning block 
blk_-8264633252613228869_1408
09/12/17 20:15:18 WARN hdfs.DFSClient: DataStreamer Exception: 
java.io.IOException: Unable to create new block.
at 
org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.nextBlockOutputStream(DFSClient.java:2818)
at 
org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$2000(DFSClient.java:2078)
at 
org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2264)

09/12/17 20:15:18 WARN hdfs.DFSClient: Error Recovery for block 
blk_-8264633252613228869_1408 bad datanode[0] nodes == null
09/12/17 20:15:18 WARN hdfs.DFSClient: Could not get block locations. Source 
file "/user/root/1261098884" - Aborting...
put: Connection refused
09/12/17 20:15:18 ERROR hdfs.DFSClient: Exception closing file 
/user/root/1261098884 : java.net.ConnectException: Connection refused
java.net.ConnectException: Connection refused
at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
at 
sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:574)
at 
org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:206)
at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:404)
at 
org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.createBlockOutputStream(DFSClient.java:2843)
at 
org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.nextBlockOutputStream(DFSClient.java:2799)
at 
org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$2000(DFSClient.java:2078)
at 
org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2264)

As you can see above, it correctly detected the down DN. But the second block 
of the file failed to write (the file left on HDFS at the end was exactly 
128M). fsck -openforwrite shows that the file is still open:

OPENFORWRITE: ./user/root/1261098884 134217728 bytes, 1 block(s), 
OPENFORWRITE: 
/user/root/1261098884:  Under replicated blk_-8026763677133524198_1408. Target 
Replicas is 3 but found 2 replica(s).


> DFS write pipeline : DFSClient sometimes does not detect second datanode 
> failure 
> -
>
> Key: HDFS-101
> URL: https://issues.apache.org/jira/browse/HDFS-101
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 0.20.1
>Reporter: Raghu Angadi
>Assignee: Hairong Kuang
>Priority: Blocker
> Fix For: 0.21.0
>
> Attachments: detectDownDN-0.20.patch, detectDownDN.patch, 
> detectDownDN1-0.20.patch, detectDownDN1.patch, hdfs-101.tar.gz
>
>
> When the first datanode's write to second datanode fails or times out 
> DFSClient ends up marking first datanode as the bad one and removes it from 
> the pipeline. Similar problem exists on DataNode as well and it is fixed in 
> HADOOP-3339. From HADOOP-3339 : 
> "The main issue is that BlockReceiver thread (and DataStreamer in the case of 
> DFSClient) interrupt() the 'responder' t

[jira] Commented: (HDFS-101) DFS write pipeline : DFSClient sometimes does not detect second datanode failure

2009-12-17 Thread Todd Lipcon (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-101?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12792261#action_12792261
 ] 

Todd Lipcon commented on HDFS-101:
--

I just tried changing that if statement to == 0 instead of > 0, and it seems to 
have fixed the bug for me. I reran the above test and it successfully ejected 
the full node:

09/12/17 19:59:21 WARN hdfs.DFSClient: DFSOutputStream ResponseProcessor 
exception  for block blk_-1132852588861426806_1405java.io.IOException: Bad 
response 1 for block blk_-1132852588861426806_1405 from datanode 
10.251.43.82:50010
at 
org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$ResponseProcessor.run(DFSClient.java:2427)

09/12/17 19:59:21 WARN hdfs.DFSClient: Error Recovery for block 
blk_-1132852588861426806_1405 bad datanode[1] 10.251.43.82:50010
09/12/17 19:59:21 WARN hdfs.DFSClient: Error Recovery for block 
blk_-1132852588861426806_1405 in pipeline 10.250.7.148:50010, 
10.251.43.82:50010, 10.251.66.212:50010: bad datanode 10.251.43.82:50010

Is it possible that clientName.length() would ever be equal to t0 in 
handleMirrorOutError? I'm new to this area of the code, so I may be missing 
something, but as I understand it, clientName.length() is 0 only for 
inter-datanode replication requests. For those writes, I didn't think any 
pipelining was done.

> DFS write pipeline : DFSClient sometimes does not detect second datanode 
> failure 
> -
>
> Key: HDFS-101
> URL: https://issues.apache.org/jira/browse/HDFS-101
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 0.20.1
>Reporter: Raghu Angadi
>Assignee: Hairong Kuang
>Priority: Blocker
> Fix For: 0.21.0
>
> Attachments: detectDownDN-0.20.patch, detectDownDN.patch, 
> detectDownDN1.patch, hdfs-101.tar.gz
>
>
> When the first datanode's write to second datanode fails or times out 
> DFSClient ends up marking first datanode as the bad one and removes it from 
> the pipeline. Similar problem exists on DataNode as well and it is fixed in 
> HADOOP-3339. From HADOOP-3339 : 
> "The main issue is that BlockReceiver thread (and DataStreamer in the case of 
> DFSClient) interrupt() the 'responder' thread. But interrupting is a pretty 
> coarse control. We don't know what state the responder is in and interrupting 
> has different effects depending on responder state. To fix this properly we 
> need to redesign how we handle these interactions."
> When the first datanode closes its socket from DFSClient, DFSClient should 
> properly read all the data left in the socket.. Also, DataNode's closing of 
> the socket should not result in a TCP reset, otherwise I think DFSClient will 
> not be able to read from the socket.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HDFS-101) DFS write pipeline : DFSClient sometimes does not detect second datanode failure

2009-12-17 Thread Todd Lipcon (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-101?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12792256#action_12792256
 ] 

Todd Lipcon commented on HDFS-101:
--

Come to think of it, it can never be the last node in the pipeline inside 
handleMirrorOutError, since there's no mirror out to have errors (duh!). But, 
not sure why that if statement exists at all, then. An error writing to the 
mirror indicates a problem with the mirror more than a problem with yourself.

> DFS write pipeline : DFSClient sometimes does not detect second datanode 
> failure 
> -
>
> Key: HDFS-101
> URL: https://issues.apache.org/jira/browse/HDFS-101
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 0.20.1
>Reporter: Raghu Angadi
>Assignee: Hairong Kuang
>Priority: Blocker
> Fix For: 0.21.0
>
> Attachments: detectDownDN-0.20.patch, detectDownDN.patch, 
> detectDownDN1.patch, hdfs-101.tar.gz
>
>
> When the first datanode's write to second datanode fails or times out 
> DFSClient ends up marking first datanode as the bad one and removes it from 
> the pipeline. Similar problem exists on DataNode as well and it is fixed in 
> HADOOP-3339. From HADOOP-3339 : 
> "The main issue is that BlockReceiver thread (and DataStreamer in the case of 
> DFSClient) interrupt() the 'responder' thread. But interrupting is a pretty 
> coarse control. We don't know what state the responder is in and interrupting 
> has different effects depending on responder state. To fix this properly we 
> need to redesign how we handle these interactions."
> When the first datanode closes its socket from DFSClient, DFSClient should 
> properly read all the data left in the socket.. Also, DataNode's closing of 
> the socket should not result in a TCP reset, otherwise I think DFSClient will 
> not be able to read from the socket.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HDFS-101) DFS write pipeline : DFSClient sometimes does not detect second datanode failure

2009-12-17 Thread Todd Lipcon (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-101?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12792248#action_12792248
 ] 

Todd Lipcon commented on HDFS-101:
--

Looking at the patch, I'm confused by the logic in handleMirrorOutError. It 
seems to me that the if statement there checking for a non-empty clientName 
should actually be checking whether it's the last node in the pipeline. If it's 
the first node in the pipeline, shouldn't it propagate the error backward just 
as if it received the error while receiving an ack?

> DFS write pipeline : DFSClient sometimes does not detect second datanode 
> failure 
> -
>
> Key: HDFS-101
> URL: https://issues.apache.org/jira/browse/HDFS-101
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 0.20.1
>Reporter: Raghu Angadi
>Assignee: Hairong Kuang
>Priority: Blocker
> Fix For: 0.21.0
>
> Attachments: detectDownDN-0.20.patch, detectDownDN.patch, 
> detectDownDN1.patch, hdfs-101.tar.gz
>
>
> When the first datanode's write to second datanode fails or times out 
> DFSClient ends up marking first datanode as the bad one and removes it from 
> the pipeline. Similar problem exists on DataNode as well and it is fixed in 
> HADOOP-3339. From HADOOP-3339 : 
> "The main issue is that BlockReceiver thread (and DataStreamer in the case of 
> DFSClient) interrupt() the 'responder' thread. But interrupting is a pretty 
> coarse control. We don't know what state the responder is in and interrupting 
> has different effects depending on responder state. To fix this properly we 
> need to redesign how we handle these interactions."
> When the first datanode closes its socket from DFSClient, DFSClient should 
> properly read all the data left in the socket.. Also, DataNode's closing of 
> the socket should not result in a TCP reset, otherwise I think DFSClient will 
> not be able to read from the socket.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HDFS-101) DFS write pipeline : DFSClient sometimes does not detect second datanode failure

2009-12-17 Thread Todd Lipcon (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-101?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12792114#action_12792114
 ] 

Todd Lipcon commented on HDFS-101:
--

Any chance that you have a patch available for branch-20 as well? The cluster 
where I can reliably reproduce this is running 0.20.1, so would like to test 
there as well as looking at it on trunk.

> DFS write pipeline : DFSClient sometimes does not detect second datanode 
> failure 
> -
>
> Key: HDFS-101
> URL: https://issues.apache.org/jira/browse/HDFS-101
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 0.20.1
>Reporter: Raghu Angadi
>Assignee: Hairong Kuang
>Priority: Blocker
> Fix For: 0.21.0
>
> Attachments: detectDownDN.patch
>
>
> When the first datanode's write to second datanode fails or times out 
> DFSClient ends up marking first datanode as the bad one and removes it from 
> the pipeline. Similar problem exists on DataNode as well and it is fixed in 
> HADOOP-3339. From HADOOP-3339 : 
> "The main issue is that BlockReceiver thread (and DataStreamer in the case of 
> DFSClient) interrupt() the 'responder' thread. But interrupting is a pretty 
> coarse control. We don't know what state the responder is in and interrupting 
> has different effects depending on responder state. To fix this properly we 
> need to redesign how we handle these interactions."
> When the first datanode closes its socket from DFSClient, DFSClient should 
> properly read all the data left in the socket.. Also, DataNode's closing of 
> the socket should not result in a TCP reset, otherwise I think DFSClient will 
> not be able to read from the socket.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HDFS-101) DFS write pipeline : DFSClient sometimes does not detect second datanode failure

2009-12-16 Thread Hairong Kuang (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-101?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12791804#action_12791804
 ] 

Hairong Kuang commented on HDFS-101:


The answer is yes. Thanks for your help testing it, Todd.

> DFS write pipeline : DFSClient sometimes does not detect second datanode 
> failure 
> -
>
> Key: HDFS-101
> URL: https://issues.apache.org/jira/browse/HDFS-101
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 0.20.1
>Reporter: Raghu Angadi
>Assignee: Hairong Kuang
>Priority: Blocker
> Fix For: 0.21.0
>
> Attachments: detectDownDN.patch
>
>
> When the first datanode's write to second datanode fails or times out 
> DFSClient ends up marking first datanode as the bad one and removes it from 
> the pipeline. Similar problem exists on DataNode as well and it is fixed in 
> HADOOP-3339. From HADOOP-3339 : 
> "The main issue is that BlockReceiver thread (and DataStreamer in the case of 
> DFSClient) interrupt() the 'responder' thread. But interrupting is a pretty 
> coarse control. We don't know what state the responder is in and interrupting 
> has different effects depending on responder state. To fix this properly we 
> need to redesign how we handle these interactions."
> When the first datanode closes its socket from DFSClient, DFSClient should 
> properly read all the data left in the socket.. Also, DataNode's closing of 
> the socket should not result in a TCP reset, otherwise I think DFSClient will 
> not be able to read from the socket.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HDFS-101) DFS write pipeline : DFSClient sometimes does not detect second datanode failure

2009-12-16 Thread Todd Lipcon (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-101?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12791778#action_12791778
 ] 

Todd Lipcon commented on HDFS-101:
--

Hi Hairong,

Do you anticipate that this will also solve HDFS-795? If so I'll try the patch 
on my test cluster tomorrow.

> DFS write pipeline : DFSClient sometimes does not detect second datanode 
> failure 
> -
>
> Key: HDFS-101
> URL: https://issues.apache.org/jira/browse/HDFS-101
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 0.20.1
>Reporter: Raghu Angadi
>Assignee: Hairong Kuang
>Priority: Blocker
> Fix For: 0.21.0
>
> Attachments: detectDownDN.patch
>
>
> When the first datanode's write to second datanode fails or times out 
> DFSClient ends up marking first datanode as the bad one and removes it from 
> the pipeline. Similar problem exists on DataNode as well and it is fixed in 
> HADOOP-3339. From HADOOP-3339 : 
> "The main issue is that BlockReceiver thread (and DataStreamer in the case of 
> DFSClient) interrupt() the 'responder' thread. But interrupting is a pretty 
> coarse control. We don't know what state the responder is in and interrupting 
> has different effects depending on responder state. To fix this properly we 
> need to redesign how we handle these interactions."
> When the first datanode closes its socket from DFSClient, DFSClient should 
> properly read all the data left in the socket.. Also, DataNode's closing of 
> the socket should not result in a TCP reset, otherwise I think DFSClient will 
> not be able to read from the socket.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HDFS-101) DFS write pipeline : DFSClient sometimes does not detect second datanode failure

2009-12-09 Thread Hairong Kuang (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-101?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12788202#action_12788202
 ] 

Hairong Kuang commented on HDFS-101:


When I thought more about it, it may make sense to let the client decide how to 
handle the case when DNi has communication problem with DNi+1 because the 
client is the one who decides the pipeline recovery policy. DNi itself has no 
problem receiving packets and storing it to the disk except that it can not 
talk to DNi+1. I think in this case it is OK that we let DNi continue to run. 

> DFS write pipeline : DFSClient sometimes does not detect second datanode 
> failure 
> -
>
> Key: HDFS-101
> URL: https://issues.apache.org/jira/browse/HDFS-101
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 0.20.1
>Reporter: Raghu Angadi
>Assignee: Hairong Kuang
>Priority: Blocker
> Fix For: 0.21.0
>
>
> When the first datanode's write to second datanode fails or times out 
> DFSClient ends up marking first datanode as the bad one and removes it from 
> the pipeline. Similar problem exists on DataNode as well and it is fixed in 
> HADOOP-3339. From HADOOP-3339 : 
> "The main issue is that BlockReceiver thread (and DataStreamer in the case of 
> DFSClient) interrupt() the 'responder' thread. But interrupting is a pretty 
> coarse control. We don't know what state the responder is in and interrupting 
> has different effects depending on responder state. To fix this properly we 
> need to redesign how we handle these interactions."
> When the first datanode closes its socket from DFSClient, DFSClient should 
> properly read all the data left in the socket.. Also, DataNode's closing of 
> the socket should not result in a TCP reset, otherwise I think DFSClient will 
> not be able to read from the socket.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HDFS-101) DFS write pipeline : DFSClient sometimes does not detect second datanode failure

2009-12-08 Thread Kan Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-101?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12787880#action_12787880
 ] 

Kan Zhang commented on HDFS-101:


> continue to run until DNi-1 or the client closes the connection.
You may not want to make the server (DN) depend on the behavior of the client 
(DFSClient).

> DFS write pipeline : DFSClient sometimes does not detect second datanode 
> failure 
> -
>
> Key: HDFS-101
> URL: https://issues.apache.org/jira/browse/HDFS-101
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 0.20.1
>Reporter: Raghu Angadi
>Assignee: Hairong Kuang
>Priority: Blocker
> Fix For: 0.21.0
>
>
> When the first datanode's write to second datanode fails or times out 
> DFSClient ends up marking first datanode as the bad one and removes it from 
> the pipeline. Similar problem exists on DataNode as well and it is fixed in 
> HADOOP-3339. From HADOOP-3339 : 
> "The main issue is that BlockReceiver thread (and DataStreamer in the case of 
> DFSClient) interrupt() the 'responder' thread. But interrupting is a pretty 
> coarse control. We don't know what state the responder is in and interrupting 
> has different effects depending on responder state. To fix this properly we 
> need to redesign how we handle these interactions."
> When the first datanode closes its socket from DFSClient, DFSClient should 
> properly read all the data left in the socket.. Also, DataNode's closing of 
> the socket should not result in a TCP reset, otherwise I think DFSClient will 
> not be able to read from the socket.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HDFS-101) DFS write pipeline : DFSClient sometimes does not detect second datanode failure

2009-12-07 Thread Hairong Kuang (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-101?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12787177#action_12787177
 ] 

Hairong Kuang commented on HDFS-101:


To be more specific, if the block receiver gets an error sending a packet to 
DNi+1, it still queue the packet to the ack queue, but with a flag 
"mirrorError" set to be true indicating that the packet has an error mirroring 
to DNi+1. The block receiver continues to write the packet to disk and then 
handle the next packet. A packet responder does not exit when detecting DNi+1 
has an error.



> DFS write pipeline : DFSClient sometimes does not detect second datanode 
> failure 
> -
>
> Key: HDFS-101
> URL: https://issues.apache.org/jira/browse/HDFS-101
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 0.20.1
>Reporter: Raghu Angadi
>Assignee: Hairong Kuang
>Priority: Blocker
> Fix For: 0.21.0
>
>
> When the first datanode's write to second datanode fails or times out 
> DFSClient ends up marking first datanode as the bad one and removes it from 
> the pipeline. Similar problem exists on DataNode as well and it is fixed in 
> HADOOP-3339. From HADOOP-3339 : 
> "The main issue is that BlockReceiver thread (and DataStreamer in the case of 
> DFSClient) interrupt() the 'responder' thread. But interrupting is a pretty 
> coarse control. We don't know what state the responder is in and interrupting 
> has different effects depending on responder state. To fix this properly we 
> need to redesign how we handle these interactions."
> When the first datanode closes its socket from DFSClient, DFSClient should 
> properly read all the data left in the socket.. Also, DataNode's closing of 
> the socket should not result in a TCP reset, otherwise I think DFSClient will 
> not be able to read from the socket.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HDFS-101) DFS write pipeline : DFSClient sometimes does not detect second datanode failure

2009-12-07 Thread Hairong Kuang (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-101?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12787137#action_12787137
 ] 

Hairong Kuang commented on HDFS-101:


Would this work? When DNi detects an error when communicating with DNi+1, it 
sends an ack indicating that DNi+1 failed and continue to run until DNi-1 or 
the client closes the connection.

> DFS write pipeline : DFSClient sometimes does not detect second datanode 
> failure 
> -
>
> Key: HDFS-101
> URL: https://issues.apache.org/jira/browse/HDFS-101
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 0.20.1
>Reporter: Raghu Angadi
>Assignee: Hairong Kuang
>Priority: Blocker
> Fix For: 0.21.0
>
>
> When the first datanode's write to second datanode fails or times out 
> DFSClient ends up marking first datanode as the bad one and removes it from 
> the pipeline. Similar problem exists on DataNode as well and it is fixed in 
> HADOOP-3339. From HADOOP-3339 : 
> "The main issue is that BlockReceiver thread (and DataStreamer in the case of 
> DFSClient) interrupt() the 'responder' thread. But interrupting is a pretty 
> coarse control. We don't know what state the responder is in and interrupting 
> has different effects depending on responder state. To fix this properly we 
> need to redesign how we handle these interactions."
> When the first datanode closes its socket from DFSClient, DFSClient should 
> properly read all the data left in the socket.. Also, DataNode's closing of 
> the socket should not result in a TCP reset, otherwise I think DFSClient will 
> not be able to read from the socket.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HDFS-101) DFS write pipeline : DFSClient sometimes does not detect second datanode failure

2009-12-07 Thread Hairong Kuang (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-101?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12787120#action_12787120
 ] 

Hairong Kuang commented on HDFS-101:


> When the first datanode closes its socket from DFSClient, DFSClient should 
> properly read all the data left in the socket..
Kan, thanks for pointing this out, which is a very valid point. I think this 
applies to data nodes in pipeline as well.

> DFS write pipeline : DFSClient sometimes does not detect second datanode 
> failure 
> -
>
> Key: HDFS-101
> URL: https://issues.apache.org/jira/browse/HDFS-101
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 0.20.1
>Reporter: Raghu Angadi
>Assignee: Hairong Kuang
>Priority: Blocker
> Fix For: 0.21.0
>
>
> When the first datanode's write to second datanode fails or times out 
> DFSClient ends up marking first datanode as the bad one and removes it from 
> the pipeline. Similar problem exists on DataNode as well and it is fixed in 
> HADOOP-3339. From HADOOP-3339 : 
> "The main issue is that BlockReceiver thread (and DataStreamer in the case of 
> DFSClient) interrupt() the 'responder' thread. But interrupting is a pretty 
> coarse control. We don't know what state the responder is in and interrupting 
> has different effects depending on responder state. To fix this properly we 
> need to redesign how we handle these interactions."
> When the first datanode closes its socket from DFSClient, DFSClient should 
> properly read all the data left in the socket.. Also, DataNode's closing of 
> the socket should not result in a TCP reset, otherwise I think DFSClient will 
> not be able to read from the socket.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HDFS-101) DFS write pipeline : DFSClient sometimes does not detect second datanode failure

2009-12-07 Thread Kan Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-101?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12787047#action_12787047
 ] 

Kan Zhang commented on HDFS-101:


> here is the plan for handling errors detected by DNi
This was the approach I took in HDFS-564. However, the problem reported in this 
JIRA was still seen even after I made the change in h564-24.patch. I suspect 
the key problem here is the following (as described in the description) and 
it's orthogonal to how a DN reports its downstream errors.
  - When the first datanode closes its socket from DFSClient, DFSClient should 
properly read all the data left in the socket.. Also, DataNode's closing of the 
socket should not result in a TCP reset, otherwise I think DFSClient will not 
be able to read from the socket.

> DFS write pipeline : DFSClient sometimes does not detect second datanode 
> failure 
> -
>
> Key: HDFS-101
> URL: https://issues.apache.org/jira/browse/HDFS-101
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 0.20.1
>Reporter: Raghu Angadi
>Assignee: Hairong Kuang
>Priority: Blocker
> Fix For: 0.21.0
>
>
> When the first datanode's write to second datanode fails or times out 
> DFSClient ends up marking first datanode as the bad one and removes it from 
> the pipeline. Similar problem exists on DataNode as well and it is fixed in 
> HADOOP-3339. From HADOOP-3339 : 
> "The main issue is that BlockReceiver thread (and DataStreamer in the case of 
> DFSClient) interrupt() the 'responder' thread. But interrupting is a pretty 
> coarse control. We don't know what state the responder is in and interrupting 
> has different effects depending on responder state. To fix this properly we 
> need to redesign how we handle these interactions."
> When the first datanode closes its socket from DFSClient, DFSClient should 
> properly read all the data left in the socket.. Also, DataNode's closing of 
> the socket should not result in a TCP reset, otherwise I think DFSClient will 
> not be able to read from the socket.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HDFS-101) DFS write pipeline : DFSClient sometimes does not detect second datanode failure

2009-12-07 Thread Hairong Kuang (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-101?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12787030#action_12787030
 ] 

Hairong Kuang commented on HDFS-101:


Assume that there is a pipeline consisting of DN0,  ..., DNi, ..., where DN0 is 
the closest to the client, here is the plan for handling errors detected by DNi:
1. If the error occurs when communicating with DNi+1, send an ack indicating 
that DNi+1 failed and then shut down both block receiver and ack responder.
2. If the error is caused by DNi itself, simply shut down both block receiver 
and ack responder. Shutting down the block receiver causes the connections to 
DNi-1 to be closed, therefore DNi-1 will detect that DNi is failed immediately.
3. If the error is caused by DNi-1, handle it the same as 2.

Errors may be detected by either the block receiver or the ack responder. No 
matter who detects it, it needs to notify the other one of the error so the 
other one will stop and shut down itself.

> DFS write pipeline : DFSClient sometimes does not detect second datanode 
> failure 
> -
>
> Key: HDFS-101
> URL: https://issues.apache.org/jira/browse/HDFS-101
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Raghu Angadi
>Assignee: Hairong Kuang
>Priority: Blocker
> Fix For: 0.21.0
>
>
> When the first datanode's write to second datanode fails or times out 
> DFSClient ends up marking first datanode as the bad one and removes it from 
> the pipeline. Similar problem exists on DataNode as well and it is fixed in 
> HADOOP-3339. From HADOOP-3339 : 
> "The main issue is that BlockReceiver thread (and DataStreamer in the case of 
> DFSClient) interrupt() the 'responder' thread. But interrupting is a pretty 
> coarse control. We don't know what state the responder is in and interrupting 
> has different effects depending on responder state. To fix this properly we 
> need to redesign how we handle these interactions."
> When the first datanode closes its socket from DFSClient, DFSClient should 
> properly read all the data left in the socket.. Also, DataNode's closing of 
> the socket should not result in a TCP reset, otherwise I think DFSClient will 
> not be able to read from the socket.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HDFS-101) DFS write pipeline : DFSClient sometimes does not detect second datanode failure

2009-12-07 Thread Hairong Kuang (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-101?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12787019#action_12787019
 ] 

Hairong Kuang commented on HDFS-101:


> Is this the same as HDFS-795? 
Yes, the only difference is that HDFS-795 describes the problem in a more 
general way.

> DFS write pipeline : DFSClient sometimes does not detect second datanode 
> failure 
> -
>
> Key: HDFS-101
> URL: https://issues.apache.org/jira/browse/HDFS-101
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Raghu Angadi
>Assignee: Hairong Kuang
>Priority: Blocker
> Fix For: 0.21.0
>
>
> When the first datanode's write to second datanode fails or times out 
> DFSClient ends up marking first datanode as the bad one and removes it from 
> the pipeline. Similar problem exists on DataNode as well and it is fixed in 
> HADOOP-3339. From HADOOP-3339 : 
> "The main issue is that BlockReceiver thread (and DataStreamer in the case of 
> DFSClient) interrupt() the 'responder' thread. But interrupting is a pretty 
> coarse control. We don't know what state the responder is in and interrupting 
> has different effects depending on responder state. To fix this properly we 
> need to redesign how we handle these interactions."
> When the first datanode closes its socket from DFSClient, DFSClient should 
> properly read all the data left in the socket.. Also, DataNode's closing of 
> the socket should not result in a TCP reset, otherwise I think DFSClient will 
> not be able to read from the socket.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HDFS-101) DFS write pipeline : DFSClient sometimes does not detect second datanode failure

2009-11-30 Thread Todd Lipcon (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-101?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12783952#action_12783952
 ] 

Todd Lipcon commented on HDFS-101:
--

Is this the same as HDFS-795?

> DFS write pipeline : DFSClient sometimes does not detect second datanode 
> failure 
> -
>
> Key: HDFS-101
> URL: https://issues.apache.org/jira/browse/HDFS-101
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Raghu Angadi
>Assignee: Hairong Kuang
>Priority: Blocker
> Fix For: 0.21.0
>
>
> When the first datanode's write to second datanode fails or times out 
> DFSClient ends up marking first datanode as the bad one and removes it from 
> the pipeline. Similar problem exists on DataNode as well and it is fixed in 
> HADOOP-3339. From HADOOP-3339 : 
> "The main issue is that BlockReceiver thread (and DataStreamer in the case of 
> DFSClient) interrupt() the 'responder' thread. But interrupting is a pretty 
> coarse control. We don't know what state the responder is in and interrupting 
> has different effects depending on responder state. To fix this properly we 
> need to redesign how we handle these interactions."
> When the first datanode closes its socket from DFSClient, DFSClient should 
> properly read all the data left in the socket.. Also, DataNode's closing of 
> the socket should not result in a TCP reset, otherwise I think DFSClient will 
> not be able to read from the socket.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HDFS-101) DFS write pipeline : DFSClient sometimes does not detect second datanode failure

2009-11-30 Thread Hairong Kuang (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-101?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12783858#action_12783858
 ] 

Hairong Kuang commented on HDFS-101:


This is not an easy problem to solve. I created HDFS-793 as the first step 
towards the solution. I will elaborate my plan here while working on HDFS-793.

> DFS write pipeline : DFSClient sometimes does not detect second datanode 
> failure 
> -
>
> Key: HDFS-101
> URL: https://issues.apache.org/jira/browse/HDFS-101
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Raghu Angadi
>Assignee: Hairong Kuang
>Priority: Blocker
> Fix For: 0.21.0
>
>
> When the first datanode's write to second datanode fails or times out 
> DFSClient ends up marking first datanode as the bad one and removes it from 
> the pipeline. Similar problem exists on DataNode as well and it is fixed in 
> HADOOP-3339. From HADOOP-3339 : 
> "The main issue is that BlockReceiver thread (and DataStreamer in the case of 
> DFSClient) interrupt() the 'responder' thread. But interrupting is a pretty 
> coarse control. We don't know what state the responder is in and interrupting 
> has different effects depending on responder state. To fix this properly we 
> need to redesign how we handle these interactions."
> When the first datanode closes its socket from DFSClient, DFSClient should 
> properly read all the data left in the socket.. Also, DataNode's closing of 
> the socket should not result in a TCP reset, otherwise I think DFSClient will 
> not be able to read from the socket.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HDFS-101) DFS write pipeline : DFSClient sometimes does not detect second datanode failure

2009-11-25 Thread Hairong Kuang (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-101?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12782669#action_12782669
 ] 

Hairong Kuang commented on HDFS-101:


It seems to me there are two issues with this problem:
1. If a datanode gets an error receiving a block, it should not simply stop 
itself. Instead, it should send a failure ack back.
2. The datanode should also identify the source of the error, whether it is 
caused by itself or the other datanodes in the pipeline, and then report the 
correct information back.

> DFS write pipeline : DFSClient sometimes does not detect second datanode 
> failure 
> -
>
> Key: HDFS-101
> URL: https://issues.apache.org/jira/browse/HDFS-101
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Raghu Angadi
>Assignee: Hairong Kuang
>Priority: Blocker
> Fix For: 0.21.0
>
>
> When the first datanode's write to second datanode fails or times out 
> DFSClient ends up marking first datanode as the bad one and removes it from 
> the pipeline. Similar problem exists on DataNode as well and it is fixed in 
> HADOOP-3339. From HADOOP-3339 : 
> "The main issue is that BlockReceiver thread (and DataStreamer in the case of 
> DFSClient) interrupt() the 'responder' thread. But interrupting is a pretty 
> coarse control. We don't know what state the responder is in and interrupting 
> has different effects depending on responder state. To fix this properly we 
> need to redesign how we handle these interactions."
> When the first datanode closes its socket from DFSClient, DFSClient should 
> properly read all the data left in the socket.. Also, DataNode's closing of 
> the socket should not result in a TCP reset, otherwise I think DFSClient will 
> not be able to read from the socket.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.