[jira] [Commented] (HDFS-9178) Slow datanode I/O can cause a wrong node to be marked bad

2019-07-09 Thread Yongjun Zhang (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-9178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16881469#comment-16881469
 ] 

Yongjun Zhang commented on HDFS-9178:
-

HI [~kihwal], many thanks for the work here! 

> Slow datanode I/O can cause a wrong node to be marked bad
> -
>
> Key: HDFS-9178
> URL: https://issues.apache.org/jira/browse/HDFS-9178
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Kihwal Lee
>Assignee: Kihwal Lee
>Priority: Critical
> Fix For: 2.8.0, 2.7.2, 2.6.4, 3.0.0-alpha1
>
> Attachments: 002-HDFS-9178.branch-2.6.patch, 
> HDFS-9178.branch-2.6.patch, HDFS-9178.patch
>
>
> When non-leaf datanode in a pipeline is slow on or stuck at disk I/O, the 
> downstream node can timeout on reading packet since even the heartbeat 
> packets will not be relayed down.  
> The packet read timeout is set in {{DataXceiver#run()}}:
> {code}
>   peer.setReadTimeout(dnConf.socketTimeout);
> {code}
> When the downstream node times out and closes the connection to the upstream, 
> the upstream node's {{PacketResponder}} gets {{EOFException}} and it sends an 
> ack upstream with the downstream node status set to {{ERROR}}.  This caused 
> the client to exclude the downstream node, even though the upstream node was 
> the one got stuck.
> The connection to downstream has longer timeout, so the downstream will 
> always timeout  first. The downstream timeout is set in {{writeBlock()}}
> {code}
>   int timeoutValue = dnConf.socketTimeout +
>   (HdfsConstants.READ_TIMEOUT_EXTENSION * targets.length);
>   int writeTimeout = dnConf.socketWriteTimeout +
>   (HdfsConstants.WRITE_TIMEOUT_EXTENSION * targets.length);
>   NetUtils.connect(mirrorSock, mirrorTarget, timeoutValue);
>   OutputStream unbufMirrorOut = NetUtils.getOutputStream(mirrorSock,
>   writeTimeout);
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-9178) Slow datanode I/O can cause a wrong node to be marked bad

2016-02-02 Thread Junping Du (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-9178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15128715#comment-15128715
 ] 

Junping Du commented on HDFS-9178:
--

Thanks [~kihwal] for review the patch! I have commit the patch to branch-2.6.

> Slow datanode I/O can cause a wrong node to be marked bad
> -
>
> Key: HDFS-9178
> URL: https://issues.apache.org/jira/browse/HDFS-9178
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Kihwal Lee
>Assignee: Kihwal Lee
>Priority: Critical
> Fix For: 2.7.2, 2.6.4
>
> Attachments: 002-HDFS-9178.branch-2.6.patch, 
> HDFS-9178.branch-2.6.patch, HDFS-9178.patch
>
>
> When non-leaf datanode in a pipeline is slow on or stuck at disk I/O, the 
> downstream node can timeout on reading packet since even the heartbeat 
> packets will not be relayed down.  
> The packet read timeout is set in {{DataXceiver#run()}}:
> {code}
>   peer.setReadTimeout(dnConf.socketTimeout);
> {code}
> When the downstream node times out and closes the connection to the upstream, 
> the upstream node's {{PacketResponder}} gets {{EOFException}} and it sends an 
> ack upstream with the downstream node status set to {{ERROR}}.  This caused 
> the client to exclude the downstream node, even thought the upstream node was 
> the one got stuck.
> The connection to downstream has longer timeout, so the downstream will 
> always timeout  first. The downstream timeout is set in {{writeBlock()}}
> {code}
>   int timeoutValue = dnConf.socketTimeout +
>   (HdfsConstants.READ_TIMEOUT_EXTENSION * targets.length);
>   int writeTimeout = dnConf.socketWriteTimeout +
>   (HdfsConstants.WRITE_TIMEOUT_EXTENSION * targets.length);
>   NetUtils.connect(mirrorSock, mirrorTarget, timeoutValue);
>   OutputStream unbufMirrorOut = NetUtils.getOutputStream(mirrorSock,
>   writeTimeout);
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-9178) Slow datanode I/O can cause a wrong node to be marked bad

2016-02-02 Thread Kihwal Lee (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-9178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15128657#comment-15128657
 ] 

Kihwal Lee commented on HDFS-9178:
--

+1 It looks like the only difference is in {{DataNodeFaultInjector}}. 

> Slow datanode I/O can cause a wrong node to be marked bad
> -
>
> Key: HDFS-9178
> URL: https://issues.apache.org/jira/browse/HDFS-9178
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Kihwal Lee
>Assignee: Kihwal Lee
>Priority: Critical
> Fix For: 2.7.2
>
> Attachments: 002-HDFS-9178.branch-2.6.patch, 
> HDFS-9178.branch-2.6.patch, HDFS-9178.patch
>
>
> When non-leaf datanode in a pipeline is slow on or stuck at disk I/O, the 
> downstream node can timeout on reading packet since even the heartbeat 
> packets will not be relayed down.  
> The packet read timeout is set in {{DataXceiver#run()}}:
> {code}
>   peer.setReadTimeout(dnConf.socketTimeout);
> {code}
> When the downstream node times out and closes the connection to the upstream, 
> the upstream node's {{PacketResponder}} gets {{EOFException}} and it sends an 
> ack upstream with the downstream node status set to {{ERROR}}.  This caused 
> the client to exclude the downstream node, even thought the upstream node was 
> the one got stuck.
> The connection to downstream has longer timeout, so the downstream will 
> always timeout  first. The downstream timeout is set in {{writeBlock()}}
> {code}
>   int timeoutValue = dnConf.socketTimeout +
>   (HdfsConstants.READ_TIMEOUT_EXTENSION * targets.length);
>   int writeTimeout = dnConf.socketWriteTimeout +
>   (HdfsConstants.WRITE_TIMEOUT_EXTENSION * targets.length);
>   NetUtils.connect(mirrorSock, mirrorTarget, timeoutValue);
>   OutputStream unbufMirrorOut = NetUtils.getOutputStream(mirrorSock,
>   writeTimeout);
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-9178) Slow datanode I/O can cause a wrong node to be marked bad

2016-01-03 Thread Junping Du (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-9178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15080547#comment-15080547
 ] 

Junping Du commented on HDFS-9178:
--

Hi [~kihwal], I saw you already attached the patch for 2.6 branch. Shall we 
commit this patch in branch-2.6? Thanks!

> Slow datanode I/O can cause a wrong node to be marked bad
> -
>
> Key: HDFS-9178
> URL: https://issues.apache.org/jira/browse/HDFS-9178
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Kihwal Lee
>Assignee: Kihwal Lee
>Priority: Critical
> Fix For: 3.0.0, 2.7.2
>
> Attachments: HDFS-9178.branch-2.6.patch, HDFS-9178.patch
>
>
> When non-leaf datanode in a pipeline is slow on or stuck at disk I/O, the 
> downstream node can timeout on reading packet since even the heartbeat 
> packets will not be relayed down.  
> The packet read timeout is set in {{DataXceiver#run()}}:
> {code}
>   peer.setReadTimeout(dnConf.socketTimeout);
> {code}
> When the downstream node times out and closes the connection to the upstream, 
> the upstream node's {{PacketResponder}} gets {{EOFException}} and it sends an 
> ack upstream with the downstream node status set to {{ERROR}}.  This caused 
> the client to exclude the downstream node, even thought the upstream node was 
> the one got stuck.
> The connection to downstream has longer timeout, so the downstream will 
> always timeout  first. The downstream timeout is set in {{writeBlock()}}
> {code}
>   int timeoutValue = dnConf.socketTimeout +
>   (HdfsConstants.READ_TIMEOUT_EXTENSION * targets.length);
>   int writeTimeout = dnConf.socketWriteTimeout +
>   (HdfsConstants.WRITE_TIMEOUT_EXTENSION * targets.length);
>   NetUtils.connect(mirrorSock, mirrorTarget, timeoutValue);
>   OutputStream unbufMirrorOut = NetUtils.getOutputStream(mirrorSock,
>   writeTimeout);
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-9178) Slow datanode I/O can cause a wrong node to be marked bad

2015-11-20 Thread Sangjin Lee (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-9178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15019094#comment-15019094
 ] 

Sangjin Lee commented on HDFS-9178:
---

Does this issue exist in 2.6.x? Should this be backported to branch-2.6?

> Slow datanode I/O can cause a wrong node to be marked bad
> -
>
> Key: HDFS-9178
> URL: https://issues.apache.org/jira/browse/HDFS-9178
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Kihwal Lee
>Assignee: Kihwal Lee
>Priority: Critical
> Fix For: 3.0.0, 2.7.2
>
> Attachments: HDFS-9178.branch-2.6.patch, HDFS-9178.patch
>
>
> When non-leaf datanode in a pipeline is slow on or stuck at disk I/O, the 
> downstream node can timeout on reading packet since even the heartbeat 
> packets will not be relayed down.  
> The packet read timeout is set in {{DataXceiver#run()}}:
> {code}
>   peer.setReadTimeout(dnConf.socketTimeout);
> {code}
> When the downstream node times out and closes the connection to the upstream, 
> the upstream node's {{PacketResponder}} gets {{EOFException}} and it sends an 
> ack upstream with the downstream node status set to {{ERROR}}.  This caused 
> the client to exclude the downstream node, even thought the upstream node was 
> the one got stuck.
> The connection to downstream has longer timeout, so the downstream will 
> always timeout  first. The downstream timeout is set in {{writeBlock()}}
> {code}
>   int timeoutValue = dnConf.socketTimeout +
>   (HdfsConstants.READ_TIMEOUT_EXTENSION * targets.length);
>   int writeTimeout = dnConf.socketWriteTimeout +
>   (HdfsConstants.WRITE_TIMEOUT_EXTENSION * targets.length);
>   NetUtils.connect(mirrorSock, mirrorTarget, timeoutValue);
>   OutputStream unbufMirrorOut = NetUtils.getOutputStream(mirrorSock,
>   writeTimeout);
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-9178) Slow datanode I/O can cause a wrong node to be marked bad

2015-10-07 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-9178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14947023#comment-14947023
 ] 

Hudson commented on HDFS-9178:
--

FAILURE: Integrated in Hadoop-trunk-Commit #8587 (See 
[https://builds.apache.org/job/Hadoop-trunk-Commit/8587/])
HDFS-9178. Slow datanode I/O can cause a wrong node to be marked bad. (kihwal: 
rev 99e5204ff5326430558b6f6fd9da7c44654c15d7)
* hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt
* 
hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/TestClientProtocolForPipelineRecovery.java
* 
hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/DataNodeFaultInjector.java
* 
hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/BlockReceiver.java


> Slow datanode I/O can cause a wrong node to be marked bad
> -
>
> Key: HDFS-9178
> URL: https://issues.apache.org/jira/browse/HDFS-9178
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Kihwal Lee
>Assignee: Kihwal Lee
>Priority: Critical
> Attachments: HDFS-9178.patch
>
>
> When non-leaf datanode in a pipeline is slow on or stuck at disk I/O, the 
> downstream node can timeout on reading packet since even the heartbeat 
> packets will not be relayed down.  
> The packet read timeout is set in {{DataXceiver#run()}}:
> {code}
>   peer.setReadTimeout(dnConf.socketTimeout);
> {code}
> When the downstream node times out and closes the connection to the upstream, 
> the upstream node's {{PacketResponder}} gets {{EOFException}} and it sends an 
> ack upstream with the downstream node status set to {{ERROR}}.  This caused 
> the client to exclude the downstream node, even thought the upstream node was 
> the one got stuck.
> The connection to downstream has longer timeout, so the downstream will 
> always timeout  first. The downstream timeout is set in {{writeBlock()}}
> {code}
>   int timeoutValue = dnConf.socketTimeout +
>   (HdfsConstants.READ_TIMEOUT_EXTENSION * targets.length);
>   int writeTimeout = dnConf.socketWriteTimeout +
>   (HdfsConstants.WRITE_TIMEOUT_EXTENSION * targets.length);
>   NetUtils.connect(mirrorSock, mirrorTarget, timeoutValue);
>   OutputStream unbufMirrorOut = NetUtils.getOutputStream(mirrorSock,
>   writeTimeout);
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-9178) Slow datanode I/O can cause a wrong node to be marked bad

2015-10-07 Thread Daryn Sharp (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-9178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14946988#comment-14946988
 ] 

Daryn Sharp commented on HDFS-9178:
---

+1 Seems to have helped fix our problems.

> Slow datanode I/O can cause a wrong node to be marked bad
> -
>
> Key: HDFS-9178
> URL: https://issues.apache.org/jira/browse/HDFS-9178
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Kihwal Lee
>Assignee: Kihwal Lee
>Priority: Critical
> Attachments: HDFS-9178.patch
>
>
> When non-leaf datanode in a pipeline is slow on or stuck at disk I/O, the 
> downstream node can timeout on reading packet since even the heartbeat 
> packets will not be relayed down.  
> The packet read timeout is set in {{DataXceiver#run()}}:
> {code}
>   peer.setReadTimeout(dnConf.socketTimeout);
> {code}
> When the downstream node times out and closes the connection to the upstream, 
> the upstream node's {{PacketResponder}} gets {{EOFException}} and it sends an 
> ack upstream with the downstream node status set to {{ERROR}}.  This caused 
> the client to exclude the downstream node, even thought the upstream node was 
> the one got stuck.
> The connection to downstream has longer timeout, so the downstream will 
> always timeout  first. The downstream timeout is set in {{writeBlock()}}
> {code}
>   int timeoutValue = dnConf.socketTimeout +
>   (HdfsConstants.READ_TIMEOUT_EXTENSION * targets.length);
>   int writeTimeout = dnConf.socketWriteTimeout +
>   (HdfsConstants.WRITE_TIMEOUT_EXTENSION * targets.length);
>   NetUtils.connect(mirrorSock, mirrorTarget, timeoutValue);
>   OutputStream unbufMirrorOut = NetUtils.getOutputStream(mirrorSock,
>   writeTimeout);
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-9178) Slow datanode I/O can cause a wrong node to be marked bad

2015-10-07 Thread Kihwal Lee (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-9178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14947041#comment-14947041
 ] 

Kihwal Lee commented on HDFS-9178:
--

Thanks for the review, Daryn. I've committed this to trunk, branch-2 and 
branch-2.7.

> Slow datanode I/O can cause a wrong node to be marked bad
> -
>
> Key: HDFS-9178
> URL: https://issues.apache.org/jira/browse/HDFS-9178
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Kihwal Lee
>Assignee: Kihwal Lee
>Priority: Critical
> Fix For: 3.0.0, 2.7.2
>
> Attachments: HDFS-9178.patch
>
>
> When non-leaf datanode in a pipeline is slow on or stuck at disk I/O, the 
> downstream node can timeout on reading packet since even the heartbeat 
> packets will not be relayed down.  
> The packet read timeout is set in {{DataXceiver#run()}}:
> {code}
>   peer.setReadTimeout(dnConf.socketTimeout);
> {code}
> When the downstream node times out and closes the connection to the upstream, 
> the upstream node's {{PacketResponder}} gets {{EOFException}} and it sends an 
> ack upstream with the downstream node status set to {{ERROR}}.  This caused 
> the client to exclude the downstream node, even thought the upstream node was 
> the one got stuck.
> The connection to downstream has longer timeout, so the downstream will 
> always timeout  first. The downstream timeout is set in {{writeBlock()}}
> {code}
>   int timeoutValue = dnConf.socketTimeout +
>   (HdfsConstants.READ_TIMEOUT_EXTENSION * targets.length);
>   int writeTimeout = dnConf.socketWriteTimeout +
>   (HdfsConstants.WRITE_TIMEOUT_EXTENSION * targets.length);
>   NetUtils.connect(mirrorSock, mirrorTarget, timeoutValue);
>   OutputStream unbufMirrorOut = NetUtils.getOutputStream(mirrorSock,
>   writeTimeout);
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-9178) Slow datanode I/O can cause a wrong node to be marked bad

2015-10-07 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-9178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14947122#comment-14947122
 ] 

Hudson commented on HDFS-9178:
--

FAILURE: Integrated in Hadoop-Yarn-trunk #1231 (See 
[https://builds.apache.org/job/Hadoop-Yarn-trunk/1231/])
HDFS-9178. Slow datanode I/O can cause a wrong node to be marked bad. (kihwal: 
rev 99e5204ff5326430558b6f6fd9da7c44654c15d7)
* 
hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/TestClientProtocolForPipelineRecovery.java
* hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt
* 
hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/BlockReceiver.java
* 
hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/DataNodeFaultInjector.java


> Slow datanode I/O can cause a wrong node to be marked bad
> -
>
> Key: HDFS-9178
> URL: https://issues.apache.org/jira/browse/HDFS-9178
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Kihwal Lee
>Assignee: Kihwal Lee
>Priority: Critical
> Fix For: 3.0.0, 2.7.2
>
> Attachments: HDFS-9178.branch-2.6.patch, HDFS-9178.patch
>
>
> When non-leaf datanode in a pipeline is slow on or stuck at disk I/O, the 
> downstream node can timeout on reading packet since even the heartbeat 
> packets will not be relayed down.  
> The packet read timeout is set in {{DataXceiver#run()}}:
> {code}
>   peer.setReadTimeout(dnConf.socketTimeout);
> {code}
> When the downstream node times out and closes the connection to the upstream, 
> the upstream node's {{PacketResponder}} gets {{EOFException}} and it sends an 
> ack upstream with the downstream node status set to {{ERROR}}.  This caused 
> the client to exclude the downstream node, even thought the upstream node was 
> the one got stuck.
> The connection to downstream has longer timeout, so the downstream will 
> always timeout  first. The downstream timeout is set in {{writeBlock()}}
> {code}
>   int timeoutValue = dnConf.socketTimeout +
>   (HdfsConstants.READ_TIMEOUT_EXTENSION * targets.length);
>   int writeTimeout = dnConf.socketWriteTimeout +
>   (HdfsConstants.WRITE_TIMEOUT_EXTENSION * targets.length);
>   NetUtils.connect(mirrorSock, mirrorTarget, timeoutValue);
>   OutputStream unbufMirrorOut = NetUtils.getOutputStream(mirrorSock,
>   writeTimeout);
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-9178) Slow datanode I/O can cause a wrong node to be marked bad

2015-10-07 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-9178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14947384#comment-14947384
 ] 

Hudson commented on HDFS-9178:
--

FAILURE: Integrated in Hadoop-Hdfs-trunk-Java8 #467 (See 
[https://builds.apache.org/job/Hadoop-Hdfs-trunk-Java8/467/])
HDFS-9178. Slow datanode I/O can cause a wrong node to be marked bad. (kihwal: 
rev 99e5204ff5326430558b6f6fd9da7c44654c15d7)
* 
hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/TestClientProtocolForPipelineRecovery.java
* 
hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/BlockReceiver.java
* 
hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/DataNodeFaultInjector.java
* hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt


> Slow datanode I/O can cause a wrong node to be marked bad
> -
>
> Key: HDFS-9178
> URL: https://issues.apache.org/jira/browse/HDFS-9178
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Kihwal Lee
>Assignee: Kihwal Lee
>Priority: Critical
> Fix For: 3.0.0, 2.7.2
>
> Attachments: HDFS-9178.branch-2.6.patch, HDFS-9178.patch
>
>
> When non-leaf datanode in a pipeline is slow on or stuck at disk I/O, the 
> downstream node can timeout on reading packet since even the heartbeat 
> packets will not be relayed down.  
> The packet read timeout is set in {{DataXceiver#run()}}:
> {code}
>   peer.setReadTimeout(dnConf.socketTimeout);
> {code}
> When the downstream node times out and closes the connection to the upstream, 
> the upstream node's {{PacketResponder}} gets {{EOFException}} and it sends an 
> ack upstream with the downstream node status set to {{ERROR}}.  This caused 
> the client to exclude the downstream node, even thought the upstream node was 
> the one got stuck.
> The connection to downstream has longer timeout, so the downstream will 
> always timeout  first. The downstream timeout is set in {{writeBlock()}}
> {code}
>   int timeoutValue = dnConf.socketTimeout +
>   (HdfsConstants.READ_TIMEOUT_EXTENSION * targets.length);
>   int writeTimeout = dnConf.socketWriteTimeout +
>   (HdfsConstants.WRITE_TIMEOUT_EXTENSION * targets.length);
>   NetUtils.connect(mirrorSock, mirrorTarget, timeoutValue);
>   OutputStream unbufMirrorOut = NetUtils.getOutputStream(mirrorSock,
>   writeTimeout);
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-9178) Slow datanode I/O can cause a wrong node to be marked bad

2015-10-07 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-9178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14947405#comment-14947405
 ] 

Hudson commented on HDFS-9178:
--

FAILURE: Integrated in Hadoop-Hdfs-trunk #2405 (See 
[https://builds.apache.org/job/Hadoop-Hdfs-trunk/2405/])
HDFS-9178. Slow datanode I/O can cause a wrong node to be marked bad. (kihwal: 
rev 99e5204ff5326430558b6f6fd9da7c44654c15d7)
* 
hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/DataNodeFaultInjector.java
* 
hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/TestClientProtocolForPipelineRecovery.java
* hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt
* 
hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/BlockReceiver.java


> Slow datanode I/O can cause a wrong node to be marked bad
> -
>
> Key: HDFS-9178
> URL: https://issues.apache.org/jira/browse/HDFS-9178
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Kihwal Lee
>Assignee: Kihwal Lee
>Priority: Critical
> Fix For: 3.0.0, 2.7.2
>
> Attachments: HDFS-9178.branch-2.6.patch, HDFS-9178.patch
>
>
> When non-leaf datanode in a pipeline is slow on or stuck at disk I/O, the 
> downstream node can timeout on reading packet since even the heartbeat 
> packets will not be relayed down.  
> The packet read timeout is set in {{DataXceiver#run()}}:
> {code}
>   peer.setReadTimeout(dnConf.socketTimeout);
> {code}
> When the downstream node times out and closes the connection to the upstream, 
> the upstream node's {{PacketResponder}} gets {{EOFException}} and it sends an 
> ack upstream with the downstream node status set to {{ERROR}}.  This caused 
> the client to exclude the downstream node, even thought the upstream node was 
> the one got stuck.
> The connection to downstream has longer timeout, so the downstream will 
> always timeout  first. The downstream timeout is set in {{writeBlock()}}
> {code}
>   int timeoutValue = dnConf.socketTimeout +
>   (HdfsConstants.READ_TIMEOUT_EXTENSION * targets.length);
>   int writeTimeout = dnConf.socketWriteTimeout +
>   (HdfsConstants.WRITE_TIMEOUT_EXTENSION * targets.length);
>   NetUtils.connect(mirrorSock, mirrorTarget, timeoutValue);
>   OutputStream unbufMirrorOut = NetUtils.getOutputStream(mirrorSock,
>   writeTimeout);
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-9178) Slow datanode I/O can cause a wrong node to be marked bad

2015-10-07 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-9178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14947280#comment-14947280
 ] 

Hudson commented on HDFS-9178:
--

FAILURE: Integrated in Hadoop-Mapreduce-trunk #2437 (See 
[https://builds.apache.org/job/Hadoop-Mapreduce-trunk/2437/])
HDFS-9178. Slow datanode I/O can cause a wrong node to be marked bad. (kihwal: 
rev 99e5204ff5326430558b6f6fd9da7c44654c15d7)
* 
hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/TestClientProtocolForPipelineRecovery.java
* 
hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/DataNodeFaultInjector.java
* 
hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/BlockReceiver.java
* hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt


> Slow datanode I/O can cause a wrong node to be marked bad
> -
>
> Key: HDFS-9178
> URL: https://issues.apache.org/jira/browse/HDFS-9178
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Kihwal Lee
>Assignee: Kihwal Lee
>Priority: Critical
> Fix For: 3.0.0, 2.7.2
>
> Attachments: HDFS-9178.branch-2.6.patch, HDFS-9178.patch
>
>
> When non-leaf datanode in a pipeline is slow on or stuck at disk I/O, the 
> downstream node can timeout on reading packet since even the heartbeat 
> packets will not be relayed down.  
> The packet read timeout is set in {{DataXceiver#run()}}:
> {code}
>   peer.setReadTimeout(dnConf.socketTimeout);
> {code}
> When the downstream node times out and closes the connection to the upstream, 
> the upstream node's {{PacketResponder}} gets {{EOFException}} and it sends an 
> ack upstream with the downstream node status set to {{ERROR}}.  This caused 
> the client to exclude the downstream node, even thought the upstream node was 
> the one got stuck.
> The connection to downstream has longer timeout, so the downstream will 
> always timeout  first. The downstream timeout is set in {{writeBlock()}}
> {code}
>   int timeoutValue = dnConf.socketTimeout +
>   (HdfsConstants.READ_TIMEOUT_EXTENSION * targets.length);
>   int writeTimeout = dnConf.socketWriteTimeout +
>   (HdfsConstants.WRITE_TIMEOUT_EXTENSION * targets.length);
>   NetUtils.connect(mirrorSock, mirrorTarget, timeoutValue);
>   OutputStream unbufMirrorOut = NetUtils.getOutputStream(mirrorSock,
>   writeTimeout);
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-9178) Slow datanode I/O can cause a wrong node to be marked bad

2015-10-07 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-9178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14947323#comment-14947323
 ] 

Hudson commented on HDFS-9178:
--

FAILURE: Integrated in Hadoop-Mapreduce-trunk-Java8 #494 (See 
[https://builds.apache.org/job/Hadoop-Mapreduce-trunk-Java8/494/])
HDFS-9178. Slow datanode I/O can cause a wrong node to be marked bad. (kihwal: 
rev 99e5204ff5326430558b6f6fd9da7c44654c15d7)
* hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt
* 
hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/DataNodeFaultInjector.java
* 
hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/BlockReceiver.java
* 
hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/TestClientProtocolForPipelineRecovery.java


> Slow datanode I/O can cause a wrong node to be marked bad
> -
>
> Key: HDFS-9178
> URL: https://issues.apache.org/jira/browse/HDFS-9178
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Kihwal Lee
>Assignee: Kihwal Lee
>Priority: Critical
> Fix For: 3.0.0, 2.7.2
>
> Attachments: HDFS-9178.branch-2.6.patch, HDFS-9178.patch
>
>
> When non-leaf datanode in a pipeline is slow on or stuck at disk I/O, the 
> downstream node can timeout on reading packet since even the heartbeat 
> packets will not be relayed down.  
> The packet read timeout is set in {{DataXceiver#run()}}:
> {code}
>   peer.setReadTimeout(dnConf.socketTimeout);
> {code}
> When the downstream node times out and closes the connection to the upstream, 
> the upstream node's {{PacketResponder}} gets {{EOFException}} and it sends an 
> ack upstream with the downstream node status set to {{ERROR}}.  This caused 
> the client to exclude the downstream node, even thought the upstream node was 
> the one got stuck.
> The connection to downstream has longer timeout, so the downstream will 
> always timeout  first. The downstream timeout is set in {{writeBlock()}}
> {code}
>   int timeoutValue = dnConf.socketTimeout +
>   (HdfsConstants.READ_TIMEOUT_EXTENSION * targets.length);
>   int writeTimeout = dnConf.socketWriteTimeout +
>   (HdfsConstants.WRITE_TIMEOUT_EXTENSION * targets.length);
>   NetUtils.connect(mirrorSock, mirrorTarget, timeoutValue);
>   OutputStream unbufMirrorOut = NetUtils.getOutputStream(mirrorSock,
>   writeTimeout);
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-9178) Slow datanode I/O can cause a wrong node to be marked bad

2015-10-07 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-9178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14947195#comment-14947195
 ] 

Hudson commented on HDFS-9178:
--

FAILURE: Integrated in Hadoop-Yarn-trunk-Java8 #502 (See 
[https://builds.apache.org/job/Hadoop-Yarn-trunk-Java8/502/])
HDFS-9178. Slow datanode I/O can cause a wrong node to be marked bad. (kihwal: 
rev 99e5204ff5326430558b6f6fd9da7c44654c15d7)
* 
hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/DataNodeFaultInjector.java
* hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt
* 
hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/BlockReceiver.java
* 
hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/TestClientProtocolForPipelineRecovery.java


> Slow datanode I/O can cause a wrong node to be marked bad
> -
>
> Key: HDFS-9178
> URL: https://issues.apache.org/jira/browse/HDFS-9178
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Kihwal Lee
>Assignee: Kihwal Lee
>Priority: Critical
> Fix For: 3.0.0, 2.7.2
>
> Attachments: HDFS-9178.branch-2.6.patch, HDFS-9178.patch
>
>
> When non-leaf datanode in a pipeline is slow on or stuck at disk I/O, the 
> downstream node can timeout on reading packet since even the heartbeat 
> packets will not be relayed down.  
> The packet read timeout is set in {{DataXceiver#run()}}:
> {code}
>   peer.setReadTimeout(dnConf.socketTimeout);
> {code}
> When the downstream node times out and closes the connection to the upstream, 
> the upstream node's {{PacketResponder}} gets {{EOFException}} and it sends an 
> ack upstream with the downstream node status set to {{ERROR}}.  This caused 
> the client to exclude the downstream node, even thought the upstream node was 
> the one got stuck.
> The connection to downstream has longer timeout, so the downstream will 
> always timeout  first. The downstream timeout is set in {{writeBlock()}}
> {code}
>   int timeoutValue = dnConf.socketTimeout +
>   (HdfsConstants.READ_TIMEOUT_EXTENSION * targets.length);
>   int writeTimeout = dnConf.socketWriteTimeout +
>   (HdfsConstants.WRITE_TIMEOUT_EXTENSION * targets.length);
>   NetUtils.connect(mirrorSock, mirrorTarget, timeoutValue);
>   OutputStream unbufMirrorOut = NetUtils.getOutputStream(mirrorSock,
>   writeTimeout);
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-9178) Slow datanode I/O can cause a wrong node to be marked bad

2015-10-01 Thread Kihwal Lee (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-9178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14939789#comment-14939789
 ] 

Kihwal Lee commented on HDFS-9178:
--

- release audit: caused by the EC branch merge
- checkstyle: file length, which was already over the "limit".
- test failures: mostly new EC related tests. They seem to pass when run 
locally, including {{TestLazyWriter}}. 

> Slow datanode I/O can cause a wrong node to be marked bad
> -
>
> Key: HDFS-9178
> URL: https://issues.apache.org/jira/browse/HDFS-9178
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Kihwal Lee
>Assignee: Kihwal Lee
>Priority: Critical
> Attachments: HDFS-9178.patch
>
>
> When non-leaf datanode in a pipeline is slow on or stuck at disk I/O, the 
> downstream node can timeout on reading packet since even the heartbeat 
> packets will not be relayed down.  
> The packet read timeout is set in {{DataXceiver#run()}}:
> {code}
>   peer.setReadTimeout(dnConf.socketTimeout);
> {code}
> When the downstream node times out and closes the connection to the upstream, 
> the upstream node's {{PacketResponder}} gets {{EOFException}} and it sends an 
> ack upstream with the downstream node status set to {{ERROR}}.  This caused 
> the client to exclude the downstream node, even thought the upstream node was 
> the one got stuck.
> The connection to downstream has longer timeout, so the downstream will 
> always timeout  first. The downstream timeout is set in {{writeBlock()}}
> {code}
>   int timeoutValue = dnConf.socketTimeout +
>   (HdfsConstants.READ_TIMEOUT_EXTENSION * targets.length);
>   int writeTimeout = dnConf.socketWriteTimeout +
>   (HdfsConstants.WRITE_TIMEOUT_EXTENSION * targets.length);
>   NetUtils.connect(mirrorSock, mirrorTarget, timeoutValue);
>   OutputStream unbufMirrorOut = NetUtils.getOutputStream(mirrorSock,
>   writeTimeout);
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-9178) Slow datanode I/O can cause a wrong node to be marked bad

2015-09-30 Thread Kihwal Lee (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-9178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14938762#comment-14938762
 ] 

Kihwal Lee commented on HDFS-9178:
--

A simple solution is to let datanode check when it last sent a packet whenever 
downstream closes connection. If it has not sent a packet for a long time (e.g. 
0.9*timeout. it is supposed to send a packet at least every 0.5*timeout), it or 
its upstream might be at fault. In this case, it will simply close connection 
to its upstream, so that the same check is triggered upstream.  If an upstream 
node thinks it has sent packets in time, the downstream node will be reported 
as bad.  When it goes all the way to client, the client will remove the first 
node and rebuild the pipeline. Since {{DataStreamer}} does not get stuck on 
disk I/O (except on rare occasion when it logs and the disk is having an 
issue), it would be either slow first node or communication problem between 
client and the first node. So removing first node seems reasonable.

> Slow datanode I/O can cause a wrong node to be marked bad
> -
>
> Key: HDFS-9178
> URL: https://issues.apache.org/jira/browse/HDFS-9178
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Kihwal Lee
>Priority: Critical
>
> When non-leaf datanode in a pipeline is slow on or stuck at disk I/O, the 
> downstream node can timeout on reading packet since even the heartbeat 
> packets will not be relayed down.  
> The packet read timeout is set in {{DataXceiver#run()}}:
> {code}
>   peer.setReadTimeout(dnConf.socketTimeout);
> {code}
> When the downstream node times out and closes the connection to the upstream, 
> the upstream node's {{PacketResponder}} gets {{EOFException}} and it sends an 
> ack upstream with the downstream node status set to {{ERROR}}.  This caused 
> the client to exclude the downstream node, even thought the upstream node was 
> the one got stuck.
> The connection to downstream has longer timeout, so the downstream will 
> always timeout  first. The downstream timeout is set in {{writeBlock()}}
> {code}
>   int timeoutValue = dnConf.socketTimeout +
>   (HdfsConstants.READ_TIMEOUT_EXTENSION * targets.length);
>   int writeTimeout = dnConf.socketWriteTimeout +
>   (HdfsConstants.WRITE_TIMEOUT_EXTENSION * targets.length);
>   NetUtils.connect(mirrorSock, mirrorTarget, timeoutValue);
>   OutputStream unbufMirrorOut = NetUtils.getOutputStream(mirrorSock,
>   writeTimeout);
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-9178) Slow datanode I/O can cause a wrong node to be marked bad

2015-09-30 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-9178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14938946#comment-14938946
 ] 

Hadoop QA commented on HDFS-9178:
-

\\
\\
| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | pre-patch |  17m 37s | Pre-patch trunk compilation is 
healthy. |
| {color:green}+1{color} | @author |   0m  1s | The patch does not contain any 
@author tags. |
| {color:green}+1{color} | tests included |   0m  0s | The patch appears to 
include 1 new or modified test files. |
| {color:green}+1{color} | javac |   7m 59s | There were no new javac warning 
messages. |
| {color:green}+1{color} | javadoc |  10m 10s | There were no new javadoc 
warning messages. |
| {color:red}-1{color} | release audit |   0m 15s | The applied patch generated 
1 release audit warnings. |
| {color:red}-1{color} | checkstyle |   1m 24s | The applied patch generated  1 
new checkstyle issues (total was 61, now 61). |
| {color:green}+1{color} | whitespace |   0m  0s | The patch has no lines that 
end in whitespace. |
| {color:green}+1{color} | install |   1m 33s | mvn install still works. |
| {color:green}+1{color} | eclipse:eclipse |   0m 34s | The patch built with 
eclipse:eclipse. |
| {color:green}+1{color} | findbugs |   2m 31s | The patch does not introduce 
any new Findbugs (version 3.0.0) warnings. |
| {color:green}+1{color} | native |   3m 11s | Pre-build of native portion |
| {color:red}-1{color} | hdfs tests |  64m 32s | Tests failed in hadoop-hdfs. |
| | | 109m 51s | |
\\
\\
|| Reason || Tests ||
| Failed unit tests | hadoop.hdfs.TestWriteReadStripedFile |
| Timed out tests | 
org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.TestLazyWriter |
|   | org.apache.hadoop.hdfs.TestDFSStripedOutputStreamWithFailure010 |
\\
\\
|| Subsystem || Report/Notes ||
| Patch URL | 
http://issues.apache.org/jira/secure/attachment/12764472/HDFS-9178.patch |
| Optional Tests | javadoc javac unit findbugs checkstyle |
| git revision | trunk / 6c17d31 |
| Release Audit | 
https://builds.apache.org/job/PreCommit-HDFS-Build/12753/artifact/patchprocess/patchReleaseAuditProblems.txt
 |
| checkstyle |  
https://builds.apache.org/job/PreCommit-HDFS-Build/12753/artifact/patchprocess/diffcheckstylehadoop-hdfs.txt
 |
| hadoop-hdfs test log | 
https://builds.apache.org/job/PreCommit-HDFS-Build/12753/artifact/patchprocess/testrun_hadoop-hdfs.txt
 |
| Test Results | 
https://builds.apache.org/job/PreCommit-HDFS-Build/12753/testReport/ |
| Java | 1.7.0_55 |
| uname | Linux asf909.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP 
PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux |
| Console output | 
https://builds.apache.org/job/PreCommit-HDFS-Build/12753/console |


This message was automatically generated.

> Slow datanode I/O can cause a wrong node to be marked bad
> -
>
> Key: HDFS-9178
> URL: https://issues.apache.org/jira/browse/HDFS-9178
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Kihwal Lee
>Assignee: Kihwal Lee
>Priority: Critical
> Attachments: HDFS-9178.patch
>
>
> When non-leaf datanode in a pipeline is slow on or stuck at disk I/O, the 
> downstream node can timeout on reading packet since even the heartbeat 
> packets will not be relayed down.  
> The packet read timeout is set in {{DataXceiver#run()}}:
> {code}
>   peer.setReadTimeout(dnConf.socketTimeout);
> {code}
> When the downstream node times out and closes the connection to the upstream, 
> the upstream node's {{PacketResponder}} gets {{EOFException}} and it sends an 
> ack upstream with the downstream node status set to {{ERROR}}.  This caused 
> the client to exclude the downstream node, even thought the upstream node was 
> the one got stuck.
> The connection to downstream has longer timeout, so the downstream will 
> always timeout  first. The downstream timeout is set in {{writeBlock()}}
> {code}
>   int timeoutValue = dnConf.socketTimeout +
>   (HdfsConstants.READ_TIMEOUT_EXTENSION * targets.length);
>   int writeTimeout = dnConf.socketWriteTimeout +
>   (HdfsConstants.WRITE_TIMEOUT_EXTENSION * targets.length);
>   NetUtils.connect(mirrorSock, mirrorTarget, timeoutValue);
>   OutputStream unbufMirrorOut = NetUtils.getOutputStream(mirrorSock,
>   writeTimeout);
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)