[jira] [Commented] (HADOOP-11252) RPC client write does not time out by default

2015-10-21 Thread Akira AJISAKA (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-11252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14966393#comment-14966393
 ] 

Akira AJISAKA commented on HADOOP-11252:


Hi [~wilfreds], how is this issue going? This issue is critical for us, so I'd 
like to fix it early.

> RPC client write does not time out by default
> -
>
> Key: HADOOP-11252
> URL: https://issues.apache.org/jira/browse/HADOOP-11252
> Project: Hadoop Common
>  Issue Type: Bug
>  Components: ipc
>Affects Versions: 2.5.0
>Reporter: Wilfred Spiegelenburg
>Assignee: Wilfred Spiegelenburg
>Priority: Critical
> Attachments: HADOOP-11252.patch
>
>
> The RPC client has a default timeout set to 0 when no timeout is passed in. 
> This means that the network connection created will not timeout when used to 
> write data. The issue has shown in YARN-2578 and HDFS-4858. Timeouts for 
> writes then fall back to the tcp level retry (configured via tcp_retries2) 
> and timeouts between the 15-30 minutes. Which is too long for a default 
> behaviour.
> Using 0 as the default value for timeout is incorrect. We should use a sane 
> value for the timeout and the "ipc.ping.interval" configuration value is a 
> logical choice for it. The default behaviour should be changed from 0 to the 
> value read for the ping interval from the Configuration.
> Fixing it in common makes more sense than finding and changing all other 
> points in the code that do not pass in a timeout.
> Offending code lines:
> https://github.com/apache/hadoop/blob/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/ipc/RPC.java#L488
> and 
> https://github.com/apache/hadoop/blob/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/ipc/RPC.java#L350



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HADOOP-11252) RPC client write does not time out by default

2015-09-14 Thread Wilfred Spiegelenburg (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-11252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14744795#comment-14744795
 ] 

Wilfred Spiegelenburg commented on HADOOP-11252:


sorry I have been occupied with a number of other things over the last period. 
I finally have some cycles and will look at this over the coming days.

> RPC client write does not time out by default
> -
>
> Key: HADOOP-11252
> URL: https://issues.apache.org/jira/browse/HADOOP-11252
> Project: Hadoop Common
>  Issue Type: Bug
>  Components: ipc
>Affects Versions: 2.5.0
>Reporter: Wilfred Spiegelenburg
>Assignee: Wilfred Spiegelenburg
>Priority: Critical
> Attachments: HADOOP-11252.patch
>
>
> The RPC client has a default timeout set to 0 when no timeout is passed in. 
> This means that the network connection created will not timeout when used to 
> write data. The issue has shown in YARN-2578 and HDFS-4858. Timeouts for 
> writes then fall back to the tcp level retry (configured via tcp_retries2) 
> and timeouts between the 15-30 minutes. Which is too long for a default 
> behaviour.
> Using 0 as the default value for timeout is incorrect. We should use a sane 
> value for the timeout and the "ipc.ping.interval" configuration value is a 
> logical choice for it. The default behaviour should be changed from 0 to the 
> value read for the ping interval from the Configuration.
> Fixing it in common makes more sense than finding and changing all other 
> points in the code that do not pass in a timeout.
> Offending code lines:
> https://github.com/apache/hadoop/blob/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/ipc/RPC.java#L488
> and 
> https://github.com/apache/hadoop/blob/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/ipc/RPC.java#L350



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HADOOP-11252) RPC client write does not time out by default

2015-09-14 Thread Masatake Iwasaki (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-11252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14743352#comment-14743352
 ] 

Masatake Iwasaki commented on HADOOP-11252:
---

[~wilfreds], do you have any update on this? I tested equivalent patch in 
YARN-2578 and +1(non-binding) for the fix. I would like to update the patch 
based on [~andrew.wang]'s comment if you don't have time. Thanks.

> RPC client write does not time out by default
> -
>
> Key: HADOOP-11252
> URL: https://issues.apache.org/jira/browse/HADOOP-11252
> Project: Hadoop Common
>  Issue Type: Bug
>  Components: ipc
>Affects Versions: 2.5.0
>Reporter: Wilfred Spiegelenburg
>Assignee: Wilfred Spiegelenburg
>Priority: Critical
> Attachments: HADOOP-11252.patch
>
>
> The RPC client has a default timeout set to 0 when no timeout is passed in. 
> This means that the network connection created will not timeout when used to 
> write data. The issue has shown in YARN-2578 and HDFS-4858. Timeouts for 
> writes then fall back to the tcp level retry (configured via tcp_retries2) 
> and timeouts between the 15-30 minutes. Which is too long for a default 
> behaviour.
> Using 0 as the default value for timeout is incorrect. We should use a sane 
> value for the timeout and the "ipc.ping.interval" configuration value is a 
> logical choice for it. The default behaviour should be changed from 0 to the 
> value read for the ping interval from the Configuration.
> Fixing it in common makes more sense than finding and changing all other 
> points in the code that do not pass in a timeout.
> Offending code lines:
> https://github.com/apache/hadoop/blob/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/ipc/RPC.java#L488
> and 
> https://github.com/apache/hadoop/blob/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/ipc/RPC.java#L350



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HADOOP-11252) RPC client write does not time out by default

2015-08-18 Thread Akira AJISAKA (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-11252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14701584#comment-14701584
 ] 

Akira AJISAKA commented on HADOOP-11252:


The patch mostly looks good to me. Hi [~wilfreds], would you reflect the review 
comments from Andrew? I'm +1 if those are addressed.

> RPC client write does not time out by default
> -
>
> Key: HADOOP-11252
> URL: https://issues.apache.org/jira/browse/HADOOP-11252
> Project: Hadoop Common
>  Issue Type: Bug
>  Components: ipc
>Affects Versions: 2.5.0
>Reporter: Wilfred Spiegelenburg
>Assignee: Wilfred Spiegelenburg
>Priority: Critical
> Attachments: HADOOP-11252.patch
>
>
> The RPC client has a default timeout set to 0 when no timeout is passed in. 
> This means that the network connection created will not timeout when used to 
> write data. The issue has shown in YARN-2578 and HDFS-4858. Timeouts for 
> writes then fall back to the tcp level retry (configured via tcp_retries2) 
> and timeouts between the 15-30 minutes. Which is too long for a default 
> behaviour.
> Using 0 as the default value for timeout is incorrect. We should use a sane 
> value for the timeout and the "ipc.ping.interval" configuration value is a 
> logical choice for it. The default behaviour should be changed from 0 to the 
> value read for the ping interval from the Configuration.
> Fixing it in common makes more sense than finding and changing all other 
> points in the code that do not pass in a timeout.
> Offending code lines:
> https://github.com/apache/hadoop/blob/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/ipc/RPC.java#L488
> and 
> https://github.com/apache/hadoop/blob/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/ipc/RPC.java#L350



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HADOOP-11252) RPC client write does not time out by default

2015-08-14 Thread Ray Chiang (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-11252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14697384#comment-14697384
 ] 

Ray Chiang commented on HADOOP-11252:
-

Sorry for the delay.  I think I'm fine with Ajith's suggestion and keeping this 
patch as-is and filing a separate JIRA for decoupling ping and timeout.

> RPC client write does not time out by default
> -
>
> Key: HADOOP-11252
> URL: https://issues.apache.org/jira/browse/HADOOP-11252
> Project: Hadoop Common
>  Issue Type: Bug
>  Components: ipc
>Affects Versions: 2.5.0
>Reporter: Wilfred Spiegelenburg
>Assignee: Wilfred Spiegelenburg
>Priority: Critical
> Attachments: HADOOP-11252.patch
>
>
> The RPC client has a default timeout set to 0 when no timeout is passed in. 
> This means that the network connection created will not timeout when used to 
> write data. The issue has shown in YARN-2578 and HDFS-4858. Timeouts for 
> writes then fall back to the tcp level retry (configured via tcp_retries2) 
> and timeouts between the 15-30 minutes. Which is too long for a default 
> behaviour.
> Using 0 as the default value for timeout is incorrect. We should use a sane 
> value for the timeout and the "ipc.ping.interval" configuration value is a 
> logical choice for it. The default behaviour should be changed from 0 to the 
> value read for the ping interval from the Configuration.
> Fixing it in common makes more sense than finding and changing all other 
> points in the code that do not pass in a timeout.
> Offending code lines:
> https://github.com/apache/hadoop/blob/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/ipc/RPC.java#L488
> and 
> https://github.com/apache/hadoop/blob/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/ipc/RPC.java#L350



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HADOOP-11252) RPC client write does not time out by default

2015-08-11 Thread Ajith S (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-11252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14681346#comment-14681346
 ] 

Ajith S commented on HADOOP-11252:
--

Patch looks good, can we commit this patch and track decoupling ping and 
timeout in separate jira.?

> RPC client write does not time out by default
> -
>
> Key: HADOOP-11252
> URL: https://issues.apache.org/jira/browse/HADOOP-11252
> Project: Hadoop Common
>  Issue Type: Bug
>  Components: ipc
>Affects Versions: 2.5.0
>Reporter: Wilfred Spiegelenburg
>Assignee: Wilfred Spiegelenburg
>Priority: Critical
> Attachments: HADOOP-11252.patch
>
>
> The RPC client has a default timeout set to 0 when no timeout is passed in. 
> This means that the network connection created will not timeout when used to 
> write data. The issue has shown in YARN-2578 and HDFS-4858. Timeouts for 
> writes then fall back to the tcp level retry (configured via tcp_retries2) 
> and timeouts between the 15-30 minutes. Which is too long for a default 
> behaviour.
> Using 0 as the default value for timeout is incorrect. We should use a sane 
> value for the timeout and the "ipc.ping.interval" configuration value is a 
> logical choice for it. The default behaviour should be changed from 0 to the 
> value read for the ping interval from the Configuration.
> Fixing it in common makes more sense than finding and changing all other 
> points in the code that do not pass in a timeout.
> Offending code lines:
> https://github.com/apache/hadoop/blob/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/ipc/RPC.java#L488
> and 
> https://github.com/apache/hadoop/blob/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/ipc/RPC.java#L350



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HADOOP-11252) RPC client write does not time out by default

2015-07-27 Thread Akira AJISAKA (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-11252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14642380#comment-14642380
 ] 

Akira AJISAKA commented on HADOOP-11252:


Hi [~rchiang] and [~wilfreds], how is this issue going? We hit this issue and 
look forward to your updates.

> RPC client write does not time out by default
> -
>
> Key: HADOOP-11252
> URL: https://issues.apache.org/jira/browse/HADOOP-11252
> Project: Hadoop Common
>  Issue Type: Bug
>  Components: ipc
>Affects Versions: 2.5.0
>Reporter: Wilfred Spiegelenburg
>Assignee: Wilfred Spiegelenburg
>Priority: Critical
> Attachments: HADOOP-11252.patch
>
>
> The RPC client has a default timeout set to 0 when no timeout is passed in. 
> This means that the network connection created will not timeout when used to 
> write data. The issue has shown in YARN-2578 and HDFS-4858. Timeouts for 
> writes then fall back to the tcp level retry (configured via tcp_retries2) 
> and timeouts between the 15-30 minutes. Which is too long for a default 
> behaviour.
> Using 0 as the default value for timeout is incorrect. We should use a sane 
> value for the timeout and the "ipc.ping.interval" configuration value is a 
> logical choice for it. The default behaviour should be changed from 0 to the 
> value read for the ping interval from the Configuration.
> Fixing it in common makes more sense than finding and changing all other 
> points in the code that do not pass in a timeout.
> Offending code lines:
> https://github.com/apache/hadoop/blob/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/ipc/RPC.java#L488
> and 
> https://github.com/apache/hadoop/blob/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/ipc/RPC.java#L350



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HADOOP-11252) RPC client write does not time out by default

2015-04-06 Thread Ray Chiang (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-11252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14482010#comment-14482010
 ] 

Ray Chiang commented on HADOOP-11252:
-

+1 (non-binding).  The code changes look fine to me.  As an easy test of its 
effect, I'm able to cause problems on my cluster by setting the value to 1ms.  
Reasonable values run fine for me.

I'd like to see the new property properly documented.  From what I can see, 
core-default.xml looks like the right place.

> RPC client write does not time out by default
> -
>
> Key: HADOOP-11252
> URL: https://issues.apache.org/jira/browse/HADOOP-11252
> Project: Hadoop Common
>  Issue Type: Bug
>  Components: ipc
>Affects Versions: 2.5.0
>Reporter: Wilfred Spiegelenburg
>Assignee: Wilfred Spiegelenburg
>Priority: Critical
> Attachments: HADOOP-11252.patch
>
>
> The RPC client has a default timeout set to 0 when no timeout is passed in. 
> This means that the network connection created will not timeout when used to 
> write data. The issue has shown in YARN-2578 and HDFS-4858. Timeouts for 
> writes then fall back to the tcp level retry (configured via tcp_retries2) 
> and timeouts between the 15-30 minutes. Which is too long for a default 
> behaviour.
> Using 0 as the default value for timeout is incorrect. We should use a sane 
> value for the timeout and the "ipc.ping.interval" configuration value is a 
> logical choice for it. The default behaviour should be changed from 0 to the 
> value read for the ping interval from the Configuration.
> Fixing it in common makes more sense than finding and changing all other 
> points in the code that do not pass in a timeout.
> Offending code lines:
> https://github.com/apache/hadoop/blob/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/ipc/RPC.java#L488
> and 
> https://github.com/apache/hadoop/blob/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/ipc/RPC.java#L350



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HADOOP-11252) RPC client write does not time out by default

2015-02-18 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-11252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14325856#comment-14325856
 ] 

Hadoop QA commented on HADOOP-11252:


{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12684142/HADOOP-11252.patch
  against trunk revision 3f56a4c.

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:red}-1 tests included{color}.  The patch doesn't appear to include 
any new or modified tests.
Please justify why no new tests are needed for this 
patch.
Also please list what manual steps were performed to 
verify this patch.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-common-project/hadoop-common hadoop-hdfs-project/hadoop-hdfs.

Test results: 
https://builds.apache.org/job/PreCommit-HADOOP-Build/5735//testReport/
Console output: 
https://builds.apache.org/job/PreCommit-HADOOP-Build/5735//console

This message is automatically generated.

> RPC client write does not time out by default
> -
>
> Key: HADOOP-11252
> URL: https://issues.apache.org/jira/browse/HADOOP-11252
> Project: Hadoop Common
>  Issue Type: Bug
>  Components: ipc
>Affects Versions: 2.5.0
>Reporter: Wilfred Spiegelenburg
>Assignee: Wilfred Spiegelenburg
>Priority: Critical
> Attachments: HADOOP-11252.patch
>
>
> The RPC client has a default timeout set to 0 when no timeout is passed in. 
> This means that the network connection created will not timeout when used to 
> write data. The issue has shown in YARN-2578 and HDFS-4858. Timeouts for 
> writes then fall back to the tcp level retry (configured via tcp_retries2) 
> and timeouts between the 15-30 minutes. Which is too long for a default 
> behaviour.
> Using 0 as the default value for timeout is incorrect. We should use a sane 
> value for the timeout and the "ipc.ping.interval" configuration value is a 
> logical choice for it. The default behaviour should be changed from 0 to the 
> value read for the ping interval from the Configuration.
> Fixing it in common makes more sense than finding and changing all other 
> points in the code that do not pass in a timeout.
> Offending code lines:
> https://github.com/apache/hadoop/blob/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/ipc/RPC.java#L488
> and 
> https://github.com/apache/hadoop/blob/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/ipc/RPC.java#L350



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HADOOP-11252) RPC client write does not time out by default

2014-12-03 Thread Wilfred Spiegelenburg (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-11252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14233209#comment-14233209
 ] 

Wilfred Spiegelenburg commented on HADOOP-11252:


[~andrew.wang] due to the way the rpc timeout in the client code overwrites the 
ping time out you are most likely correct. I'll have to step through the code 
in the client to make sure it behaves as intended. The ping is generated after 
a {{SocketTimeoutException}} is thrown on the input stream which is triggered 
by the {{setSoTimeout(pingInterval)}} on the socket, combined with the 
overwrite that could be a problem. This might require a further decoupling of 
the ping and rpc time out.

I also noticed that the ping output stream is created with a fixed timeout of 
0, that means we can still hang up there after the changes.

Looking at the HDFS code also to see how it is handled there and all references 
to the timeout that we are setting are called "socket write time out". I am 
happy to call it something else but this seems to be in line with HDFS also. 
The SO_SNDTIMEO only comes into play when the send buffers on the OS level on 
the local machine are full (as far as I am aware). If the buffer was not full 
when I wrote the data the time out will never trigger and I directly fall 
through to the tcp retries. That case should be handled by the time out we are 
setting.

The default change was a proposal and setting it to 0 is the right choice for 
backwards compatibility.

> RPC client write does not time out by default
> -
>
> Key: HADOOP-11252
> URL: https://issues.apache.org/jira/browse/HADOOP-11252
> Project: Hadoop Common
>  Issue Type: Bug
>  Components: ipc
>Affects Versions: 2.5.0
>Reporter: Wilfred Spiegelenburg
>Assignee: Wilfred Spiegelenburg
>Priority: Critical
> Attachments: HADOOP-11252.patch
>
>
> The RPC client has a default timeout set to 0 when no timeout is passed in. 
> This means that the network connection created will not timeout when used to 
> write data. The issue has shown in YARN-2578 and HDFS-4858. Timeouts for 
> writes then fall back to the tcp level retry (configured via tcp_retries2) 
> and timeouts between the 15-30 minutes. Which is too long for a default 
> behaviour.
> Using 0 as the default value for timeout is incorrect. We should use a sane 
> value for the timeout and the "ipc.ping.interval" configuration value is a 
> logical choice for it. The default behaviour should be changed from 0 to the 
> value read for the ping interval from the Configuration.
> Fixing it in common makes more sense than finding and changing all other 
> points in the code that do not pass in a timeout.
> Offending code lines:
> https://github.com/apache/hadoop/blob/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/ipc/RPC.java#L488
> and 
> https://github.com/apache/hadoop/blob/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/ipc/RPC.java#L350



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HADOOP-11252) RPC client write does not time out by default

2014-12-01 Thread Andrew Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-11252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14230601#comment-14230601
 ] 

Andrew Wang commented on HADOOP-11252:
--

Hi Wilfred, thanks for working on this.

I want to start by making sure I understand the patch correctly. We're changing 
the default rpc timeout to be 5min rather than 0. This means that, rather than 
sending a ping after a read blocks for 60s, we throw an exception after a read 
blocks for 5 mins. This actually does not involve write timeouts in the 
SO_SNDTIMEO sense, so it seems misleading to call it a "write timeout". If we 
get blocked on the socket write, we will still get stuck until the tcp stack 
bugs out (the tcp_retries2 you've mentioned elsewhere).

As [~daryn] points out above, and also on HDFS-4858 by [~atm], we've 
historically been reticent to change defaults like this because of potential 
side-effects. I'm not comfortable changing the defaults here either, without 
sign-off from e.g. [~daryn] who knows the RPC stuff better.

So, a few review comments:

* Let's rename the config param as Ming recommends above, seems more accurate. 
Including Ming's unit test would also be great.
* Let's keep the default value of this at 0 to preserve current behavior, 
unless [~daryn] ok's things.
* Since getPingInterval is now package-protected, we should also change 
setPingInterval to package-protected for parity. It's only used in a test.
* Need to add the new config key to core-default.xml also, with description.

> RPC client write does not time out by default
> -
>
> Key: HADOOP-11252
> URL: https://issues.apache.org/jira/browse/HADOOP-11252
> Project: Hadoop Common
>  Issue Type: Bug
>  Components: ipc
>Affects Versions: 2.5.0
>Reporter: Wilfred Spiegelenburg
>Assignee: Wilfred Spiegelenburg
>Priority: Critical
> Attachments: HADOOP-11252.patch
>
>
> The RPC client has a default timeout set to 0 when no timeout is passed in. 
> This means that the network connection created will not timeout when used to 
> write data. The issue has shown in YARN-2578 and HDFS-4858. Timeouts for 
> writes then fall back to the tcp level retry (configured via tcp_retries2) 
> and timeouts between the 15-30 minutes. Which is too long for a default 
> behaviour.
> Using 0 as the default value for timeout is incorrect. We should use a sane 
> value for the timeout and the "ipc.ping.interval" configuration value is a 
> logical choice for it. The default behaviour should be changed from 0 to the 
> value read for the ping interval from the Configuration.
> Fixing it in common makes more sense than finding and changing all other 
> points in the code that do not pass in a timeout.
> Offending code lines:
> https://github.com/apache/hadoop/blob/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/ipc/RPC.java#L488
> and 
> https://github.com/apache/hadoop/blob/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/ipc/RPC.java#L350



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HADOOP-11252) RPC client write does not time out by default

2014-12-01 Thread Ming Ma (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-11252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14229964#comment-14229964
 ] 

Ming Ma commented on HADOOP-11252:
--

Thanks, [~wilfreds]. Yes, it should cover all the cases discussed so far.

Regarding the name of the parameter, write.timeout seems to indicates the RPC 
call request doesn't make it to the server. As mentioned above, there are other 
scenarios where the RPC call request actually makes it to the server only that 
server doesn't respond. Should we use another name like 
"ipc.client.call.timeout.ms"?

For the unit test, we can verify the scenario where RPC call queue is full and 
thus RPC server doesn't respond in time. Here is the code snippet 
https://gist.github.com/mingmasplace/ce544ab21bbc4ff17564.

> RPC client write does not time out by default
> -
>
> Key: HADOOP-11252
> URL: https://issues.apache.org/jira/browse/HADOOP-11252
> Project: Hadoop Common
>  Issue Type: Bug
>  Components: ipc
>Affects Versions: 2.5.0
>Reporter: Wilfred Spiegelenburg
>Assignee: Wilfred Spiegelenburg
>Priority: Critical
> Attachments: HADOOP-11252.patch
>
>
> The RPC client has a default timeout set to 0 when no timeout is passed in. 
> This means that the network connection created will not timeout when used to 
> write data. The issue has shown in YARN-2578 and HDFS-4858. Timeouts for 
> writes then fall back to the tcp level retry (configured via tcp_retries2) 
> and timeouts between the 15-30 minutes. Which is too long for a default 
> behaviour.
> Using 0 as the default value for timeout is incorrect. We should use a sane 
> value for the timeout and the "ipc.ping.interval" configuration value is a 
> logical choice for it. The default behaviour should be changed from 0 to the 
> value read for the ping interval from the Configuration.
> Fixing it in common makes more sense than finding and changing all other 
> points in the code that do not pass in a timeout.
> Offending code lines:
> https://github.com/apache/hadoop/blob/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/ipc/RPC.java#L488
> and 
> https://github.com/apache/hadoop/blob/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/ipc/RPC.java#L350



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HADOOP-11252) RPC client write does not time out by default

2014-11-28 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-11252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14228242#comment-14228242
 ] 

Hadoop QA commented on HADOOP-11252:


{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12684142/HADOOP-11252.patch
  against trunk revision c1f2bb2.

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:red}-1 tests included{color}.  The patch doesn't appear to include 
any new or modified tests.
Please justify why no new tests are needed for this 
patch.
Also please list what manual steps were performed to 
verify this patch.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-common-project/hadoop-common hadoop-hdfs-project/hadoop-hdfs.

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-HADOOP-Build/5130//testReport/
Console output: 
https://builds.apache.org/job/PreCommit-HADOOP-Build/5130//console

This message is automatically generated.

> RPC client write does not time out by default
> -
>
> Key: HADOOP-11252
> URL: https://issues.apache.org/jira/browse/HADOOP-11252
> Project: Hadoop Common
>  Issue Type: Bug
>  Components: ipc
>Affects Versions: 2.5.0
>Reporter: Wilfred Spiegelenburg
>Assignee: Wilfred Spiegelenburg
>Priority: Critical
> Attachments: HADOOP-11252.patch
>
>
> The RPC client has a default timeout set to 0 when no timeout is passed in. 
> This means that the network connection created will not timeout when used to 
> write data. The issue has shown in YARN-2578 and HDFS-4858. Timeouts for 
> writes then fall back to the tcp level retry (configured via tcp_retries2) 
> and timeouts between the 15-30 minutes. Which is too long for a default 
> behaviour.
> Using 0 as the default value for timeout is incorrect. We should use a sane 
> value for the timeout and the "ipc.ping.interval" configuration value is a 
> logical choice for it. The default behaviour should be changed from 0 to the 
> value read for the ping interval from the Configuration.
> Fixing it in common makes more sense than finding and changing all other 
> points in the code that do not pass in a timeout.
> Offending code lines:
> https://github.com/apache/hadoop/blob/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/ipc/RPC.java#L488
> and 
> https://github.com/apache/hadoop/blob/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/ipc/RPC.java#L350



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HADOOP-11252) RPC client write does not time out by default

2014-11-27 Thread Wilfred Spiegelenburg (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-11252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14228120#comment-14228120
 ] 

Wilfred Spiegelenburg commented on HADOOP-11252:


A first version of a patch to set a value for the write timeout. I have used 
the proposed write timeout property name as given by [~cmccabe]. I have set it 
to an arbitrary default of 5 minutes, which seems reasonable. We still allow 
anyone to still set it to 0 (no timeout) if they want it.

The cases described by [~mingma] above YARN-2714, HDFS-4858 and YARN-2578 all 
have the same no response to a write cause. Setting this timeout should solve 
all these issues.
One point I think needs to be checked is the getTimeOut() in Client.java. It 
uses the ping interval as a timeout, if ping is not enabled. I think that this 
should be changed to use the same timeout as this change introduces. Also 
changing the time out based on the ping interval is not really logical. This 
jira might not be the correct one to introduce that change so I left it out.

> RPC client write does not time out by default
> -
>
> Key: HADOOP-11252
> URL: https://issues.apache.org/jira/browse/HADOOP-11252
> Project: Hadoop Common
>  Issue Type: Bug
>  Components: ipc
>Affects Versions: 2.5.0
>Reporter: Wilfred Spiegelenburg
>Assignee: Wilfred Spiegelenburg
>Priority: Critical
> Attachments: HADOOP-11252.patch
>
>
> The RPC client has a default timeout set to 0 when no timeout is passed in. 
> This means that the network connection created will not timeout when used to 
> write data. The issue has shown in YARN-2578 and HDFS-4858. Timeouts for 
> writes then fall back to the tcp level retry (configured via tcp_retries2) 
> and timeouts between the 15-30 minutes. Which is too long for a default 
> behaviour.
> Using 0 as the default value for timeout is incorrect. We should use a sane 
> value for the timeout and the "ipc.ping.interval" configuration value is a 
> logical choice for it. The default behaviour should be changed from 0 to the 
> value read for the ping interval from the Configuration.
> Fixing it in common makes more sense than finding and changing all other 
> points in the code that do not pass in a timeout.
> Offending code lines:
> https://github.com/apache/hadoop/blob/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/ipc/RPC.java#L488
> and 
> https://github.com/apache/hadoop/blob/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/ipc/RPC.java#L350



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HADOOP-11252) RPC client write does not time out by default

2014-11-11 Thread Ming Ma (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-11252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14206707#comment-14206707
 ] 

Ming Ma commented on HADOOP-11252:
--

[~wilfreds], are you working on this? If not, I can provide the initial patch 
for discussion.

> RPC client write does not time out by default
> -
>
> Key: HADOOP-11252
> URL: https://issues.apache.org/jira/browse/HADOOP-11252
> Project: Hadoop Common
>  Issue Type: Bug
>  Components: ipc
>Affects Versions: 2.5.0
>Reporter: Wilfred Spiegelenburg
>Priority: Critical
>
> The RPC client has a default timeout set to 0 when no timeout is passed in. 
> This means that the network connection created will not timeout when used to 
> write data. The issue has shown in YARN-2578 and HDFS-4858. Timeouts for 
> writes then fall back to the tcp level retry (configured via tcp_retries2) 
> and timeouts between the 15-30 minutes. Which is too long for a default 
> behaviour.
> Using 0 as the default value for timeout is incorrect. We should use a sane 
> value for the timeout and the "ipc.ping.interval" configuration value is a 
> logical choice for it. The default behaviour should be changed from 0 to the 
> value read for the ping interval from the Configuration.
> Fixing it in common makes more sense than finding and changing all other 
> points in the code that do not pass in a timeout.
> Offending code lines:
> https://github.com/apache/hadoop/blob/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/ipc/RPC.java#L488
> and 
> https://github.com/apache/hadoop/blob/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/ipc/RPC.java#L350



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HADOOP-11252) RPC client write does not time out by default

2014-11-10 Thread Ming Ma (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-11252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14205152#comment-14205152
 ] 

Ming Ma commented on HADOOP-11252:
--

Should we use another name other than {{ipc.client.write.timeout}} given it can 
cover scenarios besides RPC request write time out?

* HDFS-4858 covers the case where "The RPC server is unplugged before RPC call 
is delivered to the RPC server TCP stack". That is where write timeout applies.
* RPC request has been delivered to the RPC server, but client doesn't get any 
response. That could happen as in YARN-2714 where RPC server swallows 
OutOfMemoryError and just drops the response. Or the RPC request is still in 
RPC server call queue when RPC server is unplugged.

It seems like we want to define some end to end timeout, measure between the 
time when the RPC client writes the RPC call to client TCP stack and the time 
when RPC client reads the RPC response from client TCP stack.

> RPC client write does not time out by default
> -
>
> Key: HADOOP-11252
> URL: https://issues.apache.org/jira/browse/HADOOP-11252
> Project: Hadoop Common
>  Issue Type: Bug
>  Components: ipc
>Affects Versions: 2.5.0
>Reporter: Wilfred Spiegelenburg
>Priority: Critical
>
> The RPC client has a default timeout set to 0 when no timeout is passed in. 
> This means that the network connection created will not timeout when used to 
> write data. The issue has shown in YARN-2578 and HDFS-4858. Timeouts for 
> writes then fall back to the tcp level retry (configured via tcp_retries2) 
> and timeouts between the 15-30 minutes. Which is too long for a default 
> behaviour.
> Using 0 as the default value for timeout is incorrect. We should use a sane 
> value for the timeout and the "ipc.ping.interval" configuration value is a 
> logical choice for it. The default behaviour should be changed from 0 to the 
> value read for the ping interval from the Configuration.
> Fixing it in common makes more sense than finding and changing all other 
> points in the code that do not pass in a timeout.
> Offending code lines:
> https://github.com/apache/hadoop/blob/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/ipc/RPC.java#L488
> and 
> https://github.com/apache/hadoop/blob/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/ipc/RPC.java#L350



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HADOOP-11252) RPC client write does not time out by default

2014-11-03 Thread Colin Patrick McCabe (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-11252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14194781#comment-14194781
 ] 

Colin Patrick McCabe commented on HADOOP-11252:
---

bq. I agree that an optional write timeout is good, but I don't agree that 
ipc.ping.interval should be reused. It's for detecting broken connections or 
timing out when there are outstanding calls. There's an existing 
ipc.client.connect.timeout key so ipc.client.write.timeout would be a logical 
choice.

+1.  I like the idea of adding {{ipc.client.write.timeout}}.  Also agree that 
there may be some users who want to disable the timeout (i.e. set it to 0).

It would be nice if we could specify that this timeout is in milliseconds (i.e. 
name it {{ipc.client.write.timeout.ms}}).  The older timeouts are all 
unspecified, which is confusing since some of them are seconds, others 
milliseconds.

> RPC client write does not time out by default
> -
>
> Key: HADOOP-11252
> URL: https://issues.apache.org/jira/browse/HADOOP-11252
> Project: Hadoop Common
>  Issue Type: Bug
>  Components: ipc
>Affects Versions: 2.5.0
>Reporter: Wilfred Spiegelenburg
>Priority: Critical
>
> The RPC client has a default timeout set to 0 when no timeout is passed in. 
> This means that the network connection created will not timeout when used to 
> write data. The issue has shown in YARN-2578 and HDFS-4858. Timeouts for 
> writes then fall back to the tcp level retry (configured via tcp_retries2) 
> and timeouts between the 15-30 minutes. Which is too long for a default 
> behaviour.
> Using 0 as the default value for timeout is incorrect. We should use a sane 
> value for the timeout and the "ipc.ping.interval" configuration value is a 
> logical choice for it. The default behaviour should be changed from 0 to the 
> value read for the ping interval from the Configuration.
> Fixing it in common makes more sense than finding and changing all other 
> points in the code that do not pass in a timeout.
> Offending code lines:
> https://github.com/apache/hadoop/blob/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/ipc/RPC.java#L488
> and 
> https://github.com/apache/hadoop/blob/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/ipc/RPC.java#L350



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HADOOP-11252) RPC client write does not time out by default

2014-10-31 Thread Daryn Sharp (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-11252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14192636#comment-14192636
 ] 

Daryn Sharp commented on HADOOP-11252:
--

I agree that an optional write timeout is good, but I don't agree that 
{{ipc.ping.interval}} should be reused.  It's for detecting broken connections 
or timing out when there are outstanding calls.  There's an existing 
{{ipc.client.connect.timeout}} key so {{ipc.client.write.timeout}} would be a 
logical choice.

I understand this change is for reducing failover latency with config-based HA. 
 But it adds fast-fail in cases where it's not desired.  Ex. If the NN is in 
GC, the last thing you want is for clients to repeatedly timeout & reconnect, 
overflow the listen queue, etc.

During a network cut of both NNs, clients may burn through their retries 
prematurely or exponentially fall back too far.  Or with IP-failover based HA, 
you _want_ the clients to wait.  When the standby assumes the IP, the 
connections break, and the clients reconnect.

Whether you set the write timeout is based on if you favor jobs succeeding at 
all reasonable costs, or you want fast-fail that many apps won't handle well.

> RPC client write does not time out by default
> -
>
> Key: HADOOP-11252
> URL: https://issues.apache.org/jira/browse/HADOOP-11252
> Project: Hadoop Common
>  Issue Type: Bug
>  Components: ipc
>Affects Versions: 2.5.0
>Reporter: Wilfred Spiegelenburg
>Priority: Critical
>
> The RPC client has a default timeout set to 0 when no timeout is passed in. 
> This means that the network connection created will not timeout when used to 
> write data. The issue has shown in YARN-2578 and HDFS-4858. Timeouts for 
> writes then fall back to the tcp level retry (configured via tcp_retries2) 
> and timeouts between the 15-30 minutes. Which is too long for a default 
> behaviour.
> Using 0 as the default value for timeout is incorrect. We should use a sane 
> value for the timeout and the "ipc.ping.interval" configuration value is a 
> logical choice for it. The default behaviour should be changed from 0 to the 
> value read for the ping interval from the Configuration.
> Fixing it in common makes more sense than finding and changing all other 
> points in the code that do not pass in a timeout.
> Offending code lines:
> https://github.com/apache/hadoop/blob/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/ipc/RPC.java#L488
> and 
> https://github.com/apache/hadoop/blob/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/ipc/RPC.java#L350



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HADOOP-11252) RPC client write does not time out by default

2014-10-31 Thread Karthik Kambatla (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-11252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14192331#comment-14192331
 ] 

Karthik Kambatla commented on HADOOP-11252:
---

Can we make sure this patch undoes the changes in HDFS-4858, as we will be 
fixing in common? Thanks. 

> RPC client write does not time out by default
> -
>
> Key: HADOOP-11252
> URL: https://issues.apache.org/jira/browse/HADOOP-11252
> Project: Hadoop Common
>  Issue Type: Bug
>  Components: ipc
>Affects Versions: 2.5.0
>Reporter: Wilfred Spiegelenburg
>
> The RPC client has a default timeout set to 0 when no timeout is passed in. 
> This means that the network connection created will not timeout when used to 
> write data. The issue has shown in YARN-2578 and HDFS-4858. Timeouts for 
> writes then fall back to the tcp level retry (configured via tcp_retries2) 
> and timeouts between the 15-30 minutes. Which is too long for a default 
> behaviour.
> Using 0 as the default value for timeout is incorrect. We should use a sane 
> value for the timeout and the "ipc.ping.interval" configuration value is a 
> logical choice for it. The default behaviour should be changed from 0 to the 
> value read for the ping interval from the Configuration.
> Fixing it in common makes more sense than finding and changing all other 
> points in the code that do not pass in a timeout.
> Offending code lines:
> https://github.com/apache/hadoop/blob/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/ipc/RPC.java#L488
> and 
> https://github.com/apache/hadoop/blob/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/ipc/RPC.java#L350



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HADOOP-11252) RPC client write does not time out by default

2014-10-31 Thread Ming Ma (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-11252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14191944#comment-14191944
 ] 

Ming Ma commented on HADOOP-11252:
--

Agree with Wilfred that we should fix it in hadoop-common. YARN-2714 is another 
example. While  HDFS-4858 has addressed the issue between DN and NN, we still 
have issue between client and NN.

1. Have a client machine make a RPC request to NN.
2. Power off NN while NN is processing the RPC.
3. The client machine can detect connection issue in around 16 minutes. 

> RPC client write does not time out by default
> -
>
> Key: HADOOP-11252
> URL: https://issues.apache.org/jira/browse/HADOOP-11252
> Project: Hadoop Common
>  Issue Type: Bug
>  Components: ipc
>Affects Versions: 2.5.0
>Reporter: Wilfred Spiegelenburg
>
> The RPC client has a default timeout set to 0 when no timeout is passed in. 
> This means that the network connection created will not timeout when used to 
> write data. The issue has shown in YARN-2578 and HDFS-4858. Timeouts for 
> writes then fall back to the tcp level retry (configured via tcp_retries2) 
> and timeouts between the 15-30 minutes. Which is too long for a default 
> behaviour.
> Using 0 as the default value for timeout is incorrect. We should use a sane 
> value for the timeout and the "ipc.ping.interval" configuration value is a 
> logical choice for it. The default behaviour should be changed from 0 to the 
> value read for the ping interval from the Configuration.
> Fixing it in common makes more sense than finding and changing all other 
> points in the code that do not pass in a timeout.
> Offending code lines:
> https://github.com/apache/hadoop/blob/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/ipc/RPC.java#L488
> and 
> https://github.com/apache/hadoop/blob/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/ipc/RPC.java#L350



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)