[jira] [Commented] (HADOOP-11252) RPC client write does not time out by default
[ https://issues.apache.org/jira/browse/HADOOP-11252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14966393#comment-14966393 ] Akira AJISAKA commented on HADOOP-11252: Hi [~wilfreds], how is this issue going? This issue is critical for us, so I'd like to fix it early. > RPC client write does not time out by default > - > > Key: HADOOP-11252 > URL: https://issues.apache.org/jira/browse/HADOOP-11252 > Project: Hadoop Common > Issue Type: Bug > Components: ipc >Affects Versions: 2.5.0 >Reporter: Wilfred Spiegelenburg >Assignee: Wilfred Spiegelenburg >Priority: Critical > Attachments: HADOOP-11252.patch > > > The RPC client has a default timeout set to 0 when no timeout is passed in. > This means that the network connection created will not timeout when used to > write data. The issue has shown in YARN-2578 and HDFS-4858. Timeouts for > writes then fall back to the tcp level retry (configured via tcp_retries2) > and timeouts between the 15-30 minutes. Which is too long for a default > behaviour. > Using 0 as the default value for timeout is incorrect. We should use a sane > value for the timeout and the "ipc.ping.interval" configuration value is a > logical choice for it. The default behaviour should be changed from 0 to the > value read for the ping interval from the Configuration. > Fixing it in common makes more sense than finding and changing all other > points in the code that do not pass in a timeout. > Offending code lines: > https://github.com/apache/hadoop/blob/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/ipc/RPC.java#L488 > and > https://github.com/apache/hadoop/blob/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/ipc/RPC.java#L350 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HADOOP-11252) RPC client write does not time out by default
[ https://issues.apache.org/jira/browse/HADOOP-11252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14744795#comment-14744795 ] Wilfred Spiegelenburg commented on HADOOP-11252: sorry I have been occupied with a number of other things over the last period. I finally have some cycles and will look at this over the coming days. > RPC client write does not time out by default > - > > Key: HADOOP-11252 > URL: https://issues.apache.org/jira/browse/HADOOP-11252 > Project: Hadoop Common > Issue Type: Bug > Components: ipc >Affects Versions: 2.5.0 >Reporter: Wilfred Spiegelenburg >Assignee: Wilfred Spiegelenburg >Priority: Critical > Attachments: HADOOP-11252.patch > > > The RPC client has a default timeout set to 0 when no timeout is passed in. > This means that the network connection created will not timeout when used to > write data. The issue has shown in YARN-2578 and HDFS-4858. Timeouts for > writes then fall back to the tcp level retry (configured via tcp_retries2) > and timeouts between the 15-30 minutes. Which is too long for a default > behaviour. > Using 0 as the default value for timeout is incorrect. We should use a sane > value for the timeout and the "ipc.ping.interval" configuration value is a > logical choice for it. The default behaviour should be changed from 0 to the > value read for the ping interval from the Configuration. > Fixing it in common makes more sense than finding and changing all other > points in the code that do not pass in a timeout. > Offending code lines: > https://github.com/apache/hadoop/blob/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/ipc/RPC.java#L488 > and > https://github.com/apache/hadoop/blob/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/ipc/RPC.java#L350 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HADOOP-11252) RPC client write does not time out by default
[ https://issues.apache.org/jira/browse/HADOOP-11252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14743352#comment-14743352 ] Masatake Iwasaki commented on HADOOP-11252: --- [~wilfreds], do you have any update on this? I tested equivalent patch in YARN-2578 and +1(non-binding) for the fix. I would like to update the patch based on [~andrew.wang]'s comment if you don't have time. Thanks. > RPC client write does not time out by default > - > > Key: HADOOP-11252 > URL: https://issues.apache.org/jira/browse/HADOOP-11252 > Project: Hadoop Common > Issue Type: Bug > Components: ipc >Affects Versions: 2.5.0 >Reporter: Wilfred Spiegelenburg >Assignee: Wilfred Spiegelenburg >Priority: Critical > Attachments: HADOOP-11252.patch > > > The RPC client has a default timeout set to 0 when no timeout is passed in. > This means that the network connection created will not timeout when used to > write data. The issue has shown in YARN-2578 and HDFS-4858. Timeouts for > writes then fall back to the tcp level retry (configured via tcp_retries2) > and timeouts between the 15-30 minutes. Which is too long for a default > behaviour. > Using 0 as the default value for timeout is incorrect. We should use a sane > value for the timeout and the "ipc.ping.interval" configuration value is a > logical choice for it. The default behaviour should be changed from 0 to the > value read for the ping interval from the Configuration. > Fixing it in common makes more sense than finding and changing all other > points in the code that do not pass in a timeout. > Offending code lines: > https://github.com/apache/hadoop/blob/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/ipc/RPC.java#L488 > and > https://github.com/apache/hadoop/blob/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/ipc/RPC.java#L350 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HADOOP-11252) RPC client write does not time out by default
[ https://issues.apache.org/jira/browse/HADOOP-11252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14701584#comment-14701584 ] Akira AJISAKA commented on HADOOP-11252: The patch mostly looks good to me. Hi [~wilfreds], would you reflect the review comments from Andrew? I'm +1 if those are addressed. > RPC client write does not time out by default > - > > Key: HADOOP-11252 > URL: https://issues.apache.org/jira/browse/HADOOP-11252 > Project: Hadoop Common > Issue Type: Bug > Components: ipc >Affects Versions: 2.5.0 >Reporter: Wilfred Spiegelenburg >Assignee: Wilfred Spiegelenburg >Priority: Critical > Attachments: HADOOP-11252.patch > > > The RPC client has a default timeout set to 0 when no timeout is passed in. > This means that the network connection created will not timeout when used to > write data. The issue has shown in YARN-2578 and HDFS-4858. Timeouts for > writes then fall back to the tcp level retry (configured via tcp_retries2) > and timeouts between the 15-30 minutes. Which is too long for a default > behaviour. > Using 0 as the default value for timeout is incorrect. We should use a sane > value for the timeout and the "ipc.ping.interval" configuration value is a > logical choice for it. The default behaviour should be changed from 0 to the > value read for the ping interval from the Configuration. > Fixing it in common makes more sense than finding and changing all other > points in the code that do not pass in a timeout. > Offending code lines: > https://github.com/apache/hadoop/blob/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/ipc/RPC.java#L488 > and > https://github.com/apache/hadoop/blob/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/ipc/RPC.java#L350 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HADOOP-11252) RPC client write does not time out by default
[ https://issues.apache.org/jira/browse/HADOOP-11252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14697384#comment-14697384 ] Ray Chiang commented on HADOOP-11252: - Sorry for the delay. I think I'm fine with Ajith's suggestion and keeping this patch as-is and filing a separate JIRA for decoupling ping and timeout. > RPC client write does not time out by default > - > > Key: HADOOP-11252 > URL: https://issues.apache.org/jira/browse/HADOOP-11252 > Project: Hadoop Common > Issue Type: Bug > Components: ipc >Affects Versions: 2.5.0 >Reporter: Wilfred Spiegelenburg >Assignee: Wilfred Spiegelenburg >Priority: Critical > Attachments: HADOOP-11252.patch > > > The RPC client has a default timeout set to 0 when no timeout is passed in. > This means that the network connection created will not timeout when used to > write data. The issue has shown in YARN-2578 and HDFS-4858. Timeouts for > writes then fall back to the tcp level retry (configured via tcp_retries2) > and timeouts between the 15-30 minutes. Which is too long for a default > behaviour. > Using 0 as the default value for timeout is incorrect. We should use a sane > value for the timeout and the "ipc.ping.interval" configuration value is a > logical choice for it. The default behaviour should be changed from 0 to the > value read for the ping interval from the Configuration. > Fixing it in common makes more sense than finding and changing all other > points in the code that do not pass in a timeout. > Offending code lines: > https://github.com/apache/hadoop/blob/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/ipc/RPC.java#L488 > and > https://github.com/apache/hadoop/blob/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/ipc/RPC.java#L350 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HADOOP-11252) RPC client write does not time out by default
[ https://issues.apache.org/jira/browse/HADOOP-11252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14681346#comment-14681346 ] Ajith S commented on HADOOP-11252: -- Patch looks good, can we commit this patch and track decoupling ping and timeout in separate jira.? > RPC client write does not time out by default > - > > Key: HADOOP-11252 > URL: https://issues.apache.org/jira/browse/HADOOP-11252 > Project: Hadoop Common > Issue Type: Bug > Components: ipc >Affects Versions: 2.5.0 >Reporter: Wilfred Spiegelenburg >Assignee: Wilfred Spiegelenburg >Priority: Critical > Attachments: HADOOP-11252.patch > > > The RPC client has a default timeout set to 0 when no timeout is passed in. > This means that the network connection created will not timeout when used to > write data. The issue has shown in YARN-2578 and HDFS-4858. Timeouts for > writes then fall back to the tcp level retry (configured via tcp_retries2) > and timeouts between the 15-30 minutes. Which is too long for a default > behaviour. > Using 0 as the default value for timeout is incorrect. We should use a sane > value for the timeout and the "ipc.ping.interval" configuration value is a > logical choice for it. The default behaviour should be changed from 0 to the > value read for the ping interval from the Configuration. > Fixing it in common makes more sense than finding and changing all other > points in the code that do not pass in a timeout. > Offending code lines: > https://github.com/apache/hadoop/blob/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/ipc/RPC.java#L488 > and > https://github.com/apache/hadoop/blob/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/ipc/RPC.java#L350 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HADOOP-11252) RPC client write does not time out by default
[ https://issues.apache.org/jira/browse/HADOOP-11252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14642380#comment-14642380 ] Akira AJISAKA commented on HADOOP-11252: Hi [~rchiang] and [~wilfreds], how is this issue going? We hit this issue and look forward to your updates. > RPC client write does not time out by default > - > > Key: HADOOP-11252 > URL: https://issues.apache.org/jira/browse/HADOOP-11252 > Project: Hadoop Common > Issue Type: Bug > Components: ipc >Affects Versions: 2.5.0 >Reporter: Wilfred Spiegelenburg >Assignee: Wilfred Spiegelenburg >Priority: Critical > Attachments: HADOOP-11252.patch > > > The RPC client has a default timeout set to 0 when no timeout is passed in. > This means that the network connection created will not timeout when used to > write data. The issue has shown in YARN-2578 and HDFS-4858. Timeouts for > writes then fall back to the tcp level retry (configured via tcp_retries2) > and timeouts between the 15-30 minutes. Which is too long for a default > behaviour. > Using 0 as the default value for timeout is incorrect. We should use a sane > value for the timeout and the "ipc.ping.interval" configuration value is a > logical choice for it. The default behaviour should be changed from 0 to the > value read for the ping interval from the Configuration. > Fixing it in common makes more sense than finding and changing all other > points in the code that do not pass in a timeout. > Offending code lines: > https://github.com/apache/hadoop/blob/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/ipc/RPC.java#L488 > and > https://github.com/apache/hadoop/blob/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/ipc/RPC.java#L350 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HADOOP-11252) RPC client write does not time out by default
[ https://issues.apache.org/jira/browse/HADOOP-11252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14482010#comment-14482010 ] Ray Chiang commented on HADOOP-11252: - +1 (non-binding). The code changes look fine to me. As an easy test of its effect, I'm able to cause problems on my cluster by setting the value to 1ms. Reasonable values run fine for me. I'd like to see the new property properly documented. From what I can see, core-default.xml looks like the right place. > RPC client write does not time out by default > - > > Key: HADOOP-11252 > URL: https://issues.apache.org/jira/browse/HADOOP-11252 > Project: Hadoop Common > Issue Type: Bug > Components: ipc >Affects Versions: 2.5.0 >Reporter: Wilfred Spiegelenburg >Assignee: Wilfred Spiegelenburg >Priority: Critical > Attachments: HADOOP-11252.patch > > > The RPC client has a default timeout set to 0 when no timeout is passed in. > This means that the network connection created will not timeout when used to > write data. The issue has shown in YARN-2578 and HDFS-4858. Timeouts for > writes then fall back to the tcp level retry (configured via tcp_retries2) > and timeouts between the 15-30 minutes. Which is too long for a default > behaviour. > Using 0 as the default value for timeout is incorrect. We should use a sane > value for the timeout and the "ipc.ping.interval" configuration value is a > logical choice for it. The default behaviour should be changed from 0 to the > value read for the ping interval from the Configuration. > Fixing it in common makes more sense than finding and changing all other > points in the code that do not pass in a timeout. > Offending code lines: > https://github.com/apache/hadoop/blob/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/ipc/RPC.java#L488 > and > https://github.com/apache/hadoop/blob/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/ipc/RPC.java#L350 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HADOOP-11252) RPC client write does not time out by default
[ https://issues.apache.org/jira/browse/HADOOP-11252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14325856#comment-14325856 ] Hadoop QA commented on HADOOP-11252: {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12684142/HADOOP-11252.patch against trunk revision 3f56a4c. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:red}-1 tests included{color}. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-common-project/hadoop-common hadoop-hdfs-project/hadoop-hdfs. Test results: https://builds.apache.org/job/PreCommit-HADOOP-Build/5735//testReport/ Console output: https://builds.apache.org/job/PreCommit-HADOOP-Build/5735//console This message is automatically generated. > RPC client write does not time out by default > - > > Key: HADOOP-11252 > URL: https://issues.apache.org/jira/browse/HADOOP-11252 > Project: Hadoop Common > Issue Type: Bug > Components: ipc >Affects Versions: 2.5.0 >Reporter: Wilfred Spiegelenburg >Assignee: Wilfred Spiegelenburg >Priority: Critical > Attachments: HADOOP-11252.patch > > > The RPC client has a default timeout set to 0 when no timeout is passed in. > This means that the network connection created will not timeout when used to > write data. The issue has shown in YARN-2578 and HDFS-4858. Timeouts for > writes then fall back to the tcp level retry (configured via tcp_retries2) > and timeouts between the 15-30 minutes. Which is too long for a default > behaviour. > Using 0 as the default value for timeout is incorrect. We should use a sane > value for the timeout and the "ipc.ping.interval" configuration value is a > logical choice for it. The default behaviour should be changed from 0 to the > value read for the ping interval from the Configuration. > Fixing it in common makes more sense than finding and changing all other > points in the code that do not pass in a timeout. > Offending code lines: > https://github.com/apache/hadoop/blob/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/ipc/RPC.java#L488 > and > https://github.com/apache/hadoop/blob/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/ipc/RPC.java#L350 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HADOOP-11252) RPC client write does not time out by default
[ https://issues.apache.org/jira/browse/HADOOP-11252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14233209#comment-14233209 ] Wilfred Spiegelenburg commented on HADOOP-11252: [~andrew.wang] due to the way the rpc timeout in the client code overwrites the ping time out you are most likely correct. I'll have to step through the code in the client to make sure it behaves as intended. The ping is generated after a {{SocketTimeoutException}} is thrown on the input stream which is triggered by the {{setSoTimeout(pingInterval)}} on the socket, combined with the overwrite that could be a problem. This might require a further decoupling of the ping and rpc time out. I also noticed that the ping output stream is created with a fixed timeout of 0, that means we can still hang up there after the changes. Looking at the HDFS code also to see how it is handled there and all references to the timeout that we are setting are called "socket write time out". I am happy to call it something else but this seems to be in line with HDFS also. The SO_SNDTIMEO only comes into play when the send buffers on the OS level on the local machine are full (as far as I am aware). If the buffer was not full when I wrote the data the time out will never trigger and I directly fall through to the tcp retries. That case should be handled by the time out we are setting. The default change was a proposal and setting it to 0 is the right choice for backwards compatibility. > RPC client write does not time out by default > - > > Key: HADOOP-11252 > URL: https://issues.apache.org/jira/browse/HADOOP-11252 > Project: Hadoop Common > Issue Type: Bug > Components: ipc >Affects Versions: 2.5.0 >Reporter: Wilfred Spiegelenburg >Assignee: Wilfred Spiegelenburg >Priority: Critical > Attachments: HADOOP-11252.patch > > > The RPC client has a default timeout set to 0 when no timeout is passed in. > This means that the network connection created will not timeout when used to > write data. The issue has shown in YARN-2578 and HDFS-4858. Timeouts for > writes then fall back to the tcp level retry (configured via tcp_retries2) > and timeouts between the 15-30 minutes. Which is too long for a default > behaviour. > Using 0 as the default value for timeout is incorrect. We should use a sane > value for the timeout and the "ipc.ping.interval" configuration value is a > logical choice for it. The default behaviour should be changed from 0 to the > value read for the ping interval from the Configuration. > Fixing it in common makes more sense than finding and changing all other > points in the code that do not pass in a timeout. > Offending code lines: > https://github.com/apache/hadoop/blob/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/ipc/RPC.java#L488 > and > https://github.com/apache/hadoop/blob/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/ipc/RPC.java#L350 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HADOOP-11252) RPC client write does not time out by default
[ https://issues.apache.org/jira/browse/HADOOP-11252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14230601#comment-14230601 ] Andrew Wang commented on HADOOP-11252: -- Hi Wilfred, thanks for working on this. I want to start by making sure I understand the patch correctly. We're changing the default rpc timeout to be 5min rather than 0. This means that, rather than sending a ping after a read blocks for 60s, we throw an exception after a read blocks for 5 mins. This actually does not involve write timeouts in the SO_SNDTIMEO sense, so it seems misleading to call it a "write timeout". If we get blocked on the socket write, we will still get stuck until the tcp stack bugs out (the tcp_retries2 you've mentioned elsewhere). As [~daryn] points out above, and also on HDFS-4858 by [~atm], we've historically been reticent to change defaults like this because of potential side-effects. I'm not comfortable changing the defaults here either, without sign-off from e.g. [~daryn] who knows the RPC stuff better. So, a few review comments: * Let's rename the config param as Ming recommends above, seems more accurate. Including Ming's unit test would also be great. * Let's keep the default value of this at 0 to preserve current behavior, unless [~daryn] ok's things. * Since getPingInterval is now package-protected, we should also change setPingInterval to package-protected for parity. It's only used in a test. * Need to add the new config key to core-default.xml also, with description. > RPC client write does not time out by default > - > > Key: HADOOP-11252 > URL: https://issues.apache.org/jira/browse/HADOOP-11252 > Project: Hadoop Common > Issue Type: Bug > Components: ipc >Affects Versions: 2.5.0 >Reporter: Wilfred Spiegelenburg >Assignee: Wilfred Spiegelenburg >Priority: Critical > Attachments: HADOOP-11252.patch > > > The RPC client has a default timeout set to 0 when no timeout is passed in. > This means that the network connection created will not timeout when used to > write data. The issue has shown in YARN-2578 and HDFS-4858. Timeouts for > writes then fall back to the tcp level retry (configured via tcp_retries2) > and timeouts between the 15-30 minutes. Which is too long for a default > behaviour. > Using 0 as the default value for timeout is incorrect. We should use a sane > value for the timeout and the "ipc.ping.interval" configuration value is a > logical choice for it. The default behaviour should be changed from 0 to the > value read for the ping interval from the Configuration. > Fixing it in common makes more sense than finding and changing all other > points in the code that do not pass in a timeout. > Offending code lines: > https://github.com/apache/hadoop/blob/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/ipc/RPC.java#L488 > and > https://github.com/apache/hadoop/blob/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/ipc/RPC.java#L350 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HADOOP-11252) RPC client write does not time out by default
[ https://issues.apache.org/jira/browse/HADOOP-11252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14229964#comment-14229964 ] Ming Ma commented on HADOOP-11252: -- Thanks, [~wilfreds]. Yes, it should cover all the cases discussed so far. Regarding the name of the parameter, write.timeout seems to indicates the RPC call request doesn't make it to the server. As mentioned above, there are other scenarios where the RPC call request actually makes it to the server only that server doesn't respond. Should we use another name like "ipc.client.call.timeout.ms"? For the unit test, we can verify the scenario where RPC call queue is full and thus RPC server doesn't respond in time. Here is the code snippet https://gist.github.com/mingmasplace/ce544ab21bbc4ff17564. > RPC client write does not time out by default > - > > Key: HADOOP-11252 > URL: https://issues.apache.org/jira/browse/HADOOP-11252 > Project: Hadoop Common > Issue Type: Bug > Components: ipc >Affects Versions: 2.5.0 >Reporter: Wilfred Spiegelenburg >Assignee: Wilfred Spiegelenburg >Priority: Critical > Attachments: HADOOP-11252.patch > > > The RPC client has a default timeout set to 0 when no timeout is passed in. > This means that the network connection created will not timeout when used to > write data. The issue has shown in YARN-2578 and HDFS-4858. Timeouts for > writes then fall back to the tcp level retry (configured via tcp_retries2) > and timeouts between the 15-30 minutes. Which is too long for a default > behaviour. > Using 0 as the default value for timeout is incorrect. We should use a sane > value for the timeout and the "ipc.ping.interval" configuration value is a > logical choice for it. The default behaviour should be changed from 0 to the > value read for the ping interval from the Configuration. > Fixing it in common makes more sense than finding and changing all other > points in the code that do not pass in a timeout. > Offending code lines: > https://github.com/apache/hadoop/blob/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/ipc/RPC.java#L488 > and > https://github.com/apache/hadoop/blob/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/ipc/RPC.java#L350 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HADOOP-11252) RPC client write does not time out by default
[ https://issues.apache.org/jira/browse/HADOOP-11252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14228242#comment-14228242 ] Hadoop QA commented on HADOOP-11252: {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12684142/HADOOP-11252.patch against trunk revision c1f2bb2. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:red}-1 tests included{color}. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-common-project/hadoop-common hadoop-hdfs-project/hadoop-hdfs. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-HADOOP-Build/5130//testReport/ Console output: https://builds.apache.org/job/PreCommit-HADOOP-Build/5130//console This message is automatically generated. > RPC client write does not time out by default > - > > Key: HADOOP-11252 > URL: https://issues.apache.org/jira/browse/HADOOP-11252 > Project: Hadoop Common > Issue Type: Bug > Components: ipc >Affects Versions: 2.5.0 >Reporter: Wilfred Spiegelenburg >Assignee: Wilfred Spiegelenburg >Priority: Critical > Attachments: HADOOP-11252.patch > > > The RPC client has a default timeout set to 0 when no timeout is passed in. > This means that the network connection created will not timeout when used to > write data. The issue has shown in YARN-2578 and HDFS-4858. Timeouts for > writes then fall back to the tcp level retry (configured via tcp_retries2) > and timeouts between the 15-30 minutes. Which is too long for a default > behaviour. > Using 0 as the default value for timeout is incorrect. We should use a sane > value for the timeout and the "ipc.ping.interval" configuration value is a > logical choice for it. The default behaviour should be changed from 0 to the > value read for the ping interval from the Configuration. > Fixing it in common makes more sense than finding and changing all other > points in the code that do not pass in a timeout. > Offending code lines: > https://github.com/apache/hadoop/blob/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/ipc/RPC.java#L488 > and > https://github.com/apache/hadoop/blob/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/ipc/RPC.java#L350 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HADOOP-11252) RPC client write does not time out by default
[ https://issues.apache.org/jira/browse/HADOOP-11252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14228120#comment-14228120 ] Wilfred Spiegelenburg commented on HADOOP-11252: A first version of a patch to set a value for the write timeout. I have used the proposed write timeout property name as given by [~cmccabe]. I have set it to an arbitrary default of 5 minutes, which seems reasonable. We still allow anyone to still set it to 0 (no timeout) if they want it. The cases described by [~mingma] above YARN-2714, HDFS-4858 and YARN-2578 all have the same no response to a write cause. Setting this timeout should solve all these issues. One point I think needs to be checked is the getTimeOut() in Client.java. It uses the ping interval as a timeout, if ping is not enabled. I think that this should be changed to use the same timeout as this change introduces. Also changing the time out based on the ping interval is not really logical. This jira might not be the correct one to introduce that change so I left it out. > RPC client write does not time out by default > - > > Key: HADOOP-11252 > URL: https://issues.apache.org/jira/browse/HADOOP-11252 > Project: Hadoop Common > Issue Type: Bug > Components: ipc >Affects Versions: 2.5.0 >Reporter: Wilfred Spiegelenburg >Assignee: Wilfred Spiegelenburg >Priority: Critical > Attachments: HADOOP-11252.patch > > > The RPC client has a default timeout set to 0 when no timeout is passed in. > This means that the network connection created will not timeout when used to > write data. The issue has shown in YARN-2578 and HDFS-4858. Timeouts for > writes then fall back to the tcp level retry (configured via tcp_retries2) > and timeouts between the 15-30 minutes. Which is too long for a default > behaviour. > Using 0 as the default value for timeout is incorrect. We should use a sane > value for the timeout and the "ipc.ping.interval" configuration value is a > logical choice for it. The default behaviour should be changed from 0 to the > value read for the ping interval from the Configuration. > Fixing it in common makes more sense than finding and changing all other > points in the code that do not pass in a timeout. > Offending code lines: > https://github.com/apache/hadoop/blob/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/ipc/RPC.java#L488 > and > https://github.com/apache/hadoop/blob/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/ipc/RPC.java#L350 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HADOOP-11252) RPC client write does not time out by default
[ https://issues.apache.org/jira/browse/HADOOP-11252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14206707#comment-14206707 ] Ming Ma commented on HADOOP-11252: -- [~wilfreds], are you working on this? If not, I can provide the initial patch for discussion. > RPC client write does not time out by default > - > > Key: HADOOP-11252 > URL: https://issues.apache.org/jira/browse/HADOOP-11252 > Project: Hadoop Common > Issue Type: Bug > Components: ipc >Affects Versions: 2.5.0 >Reporter: Wilfred Spiegelenburg >Priority: Critical > > The RPC client has a default timeout set to 0 when no timeout is passed in. > This means that the network connection created will not timeout when used to > write data. The issue has shown in YARN-2578 and HDFS-4858. Timeouts for > writes then fall back to the tcp level retry (configured via tcp_retries2) > and timeouts between the 15-30 minutes. Which is too long for a default > behaviour. > Using 0 as the default value for timeout is incorrect. We should use a sane > value for the timeout and the "ipc.ping.interval" configuration value is a > logical choice for it. The default behaviour should be changed from 0 to the > value read for the ping interval from the Configuration. > Fixing it in common makes more sense than finding and changing all other > points in the code that do not pass in a timeout. > Offending code lines: > https://github.com/apache/hadoop/blob/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/ipc/RPC.java#L488 > and > https://github.com/apache/hadoop/blob/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/ipc/RPC.java#L350 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HADOOP-11252) RPC client write does not time out by default
[ https://issues.apache.org/jira/browse/HADOOP-11252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14205152#comment-14205152 ] Ming Ma commented on HADOOP-11252: -- Should we use another name other than {{ipc.client.write.timeout}} given it can cover scenarios besides RPC request write time out? * HDFS-4858 covers the case where "The RPC server is unplugged before RPC call is delivered to the RPC server TCP stack". That is where write timeout applies. * RPC request has been delivered to the RPC server, but client doesn't get any response. That could happen as in YARN-2714 where RPC server swallows OutOfMemoryError and just drops the response. Or the RPC request is still in RPC server call queue when RPC server is unplugged. It seems like we want to define some end to end timeout, measure between the time when the RPC client writes the RPC call to client TCP stack and the time when RPC client reads the RPC response from client TCP stack. > RPC client write does not time out by default > - > > Key: HADOOP-11252 > URL: https://issues.apache.org/jira/browse/HADOOP-11252 > Project: Hadoop Common > Issue Type: Bug > Components: ipc >Affects Versions: 2.5.0 >Reporter: Wilfred Spiegelenburg >Priority: Critical > > The RPC client has a default timeout set to 0 when no timeout is passed in. > This means that the network connection created will not timeout when used to > write data. The issue has shown in YARN-2578 and HDFS-4858. Timeouts for > writes then fall back to the tcp level retry (configured via tcp_retries2) > and timeouts between the 15-30 minutes. Which is too long for a default > behaviour. > Using 0 as the default value for timeout is incorrect. We should use a sane > value for the timeout and the "ipc.ping.interval" configuration value is a > logical choice for it. The default behaviour should be changed from 0 to the > value read for the ping interval from the Configuration. > Fixing it in common makes more sense than finding and changing all other > points in the code that do not pass in a timeout. > Offending code lines: > https://github.com/apache/hadoop/blob/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/ipc/RPC.java#L488 > and > https://github.com/apache/hadoop/blob/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/ipc/RPC.java#L350 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HADOOP-11252) RPC client write does not time out by default
[ https://issues.apache.org/jira/browse/HADOOP-11252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14194781#comment-14194781 ] Colin Patrick McCabe commented on HADOOP-11252: --- bq. I agree that an optional write timeout is good, but I don't agree that ipc.ping.interval should be reused. It's for detecting broken connections or timing out when there are outstanding calls. There's an existing ipc.client.connect.timeout key so ipc.client.write.timeout would be a logical choice. +1. I like the idea of adding {{ipc.client.write.timeout}}. Also agree that there may be some users who want to disable the timeout (i.e. set it to 0). It would be nice if we could specify that this timeout is in milliseconds (i.e. name it {{ipc.client.write.timeout.ms}}). The older timeouts are all unspecified, which is confusing since some of them are seconds, others milliseconds. > RPC client write does not time out by default > - > > Key: HADOOP-11252 > URL: https://issues.apache.org/jira/browse/HADOOP-11252 > Project: Hadoop Common > Issue Type: Bug > Components: ipc >Affects Versions: 2.5.0 >Reporter: Wilfred Spiegelenburg >Priority: Critical > > The RPC client has a default timeout set to 0 when no timeout is passed in. > This means that the network connection created will not timeout when used to > write data. The issue has shown in YARN-2578 and HDFS-4858. Timeouts for > writes then fall back to the tcp level retry (configured via tcp_retries2) > and timeouts between the 15-30 minutes. Which is too long for a default > behaviour. > Using 0 as the default value for timeout is incorrect. We should use a sane > value for the timeout and the "ipc.ping.interval" configuration value is a > logical choice for it. The default behaviour should be changed from 0 to the > value read for the ping interval from the Configuration. > Fixing it in common makes more sense than finding and changing all other > points in the code that do not pass in a timeout. > Offending code lines: > https://github.com/apache/hadoop/blob/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/ipc/RPC.java#L488 > and > https://github.com/apache/hadoop/blob/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/ipc/RPC.java#L350 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HADOOP-11252) RPC client write does not time out by default
[ https://issues.apache.org/jira/browse/HADOOP-11252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14192636#comment-14192636 ] Daryn Sharp commented on HADOOP-11252: -- I agree that an optional write timeout is good, but I don't agree that {{ipc.ping.interval}} should be reused. It's for detecting broken connections or timing out when there are outstanding calls. There's an existing {{ipc.client.connect.timeout}} key so {{ipc.client.write.timeout}} would be a logical choice. I understand this change is for reducing failover latency with config-based HA. But it adds fast-fail in cases where it's not desired. Ex. If the NN is in GC, the last thing you want is for clients to repeatedly timeout & reconnect, overflow the listen queue, etc. During a network cut of both NNs, clients may burn through their retries prematurely or exponentially fall back too far. Or with IP-failover based HA, you _want_ the clients to wait. When the standby assumes the IP, the connections break, and the clients reconnect. Whether you set the write timeout is based on if you favor jobs succeeding at all reasonable costs, or you want fast-fail that many apps won't handle well. > RPC client write does not time out by default > - > > Key: HADOOP-11252 > URL: https://issues.apache.org/jira/browse/HADOOP-11252 > Project: Hadoop Common > Issue Type: Bug > Components: ipc >Affects Versions: 2.5.0 >Reporter: Wilfred Spiegelenburg >Priority: Critical > > The RPC client has a default timeout set to 0 when no timeout is passed in. > This means that the network connection created will not timeout when used to > write data. The issue has shown in YARN-2578 and HDFS-4858. Timeouts for > writes then fall back to the tcp level retry (configured via tcp_retries2) > and timeouts between the 15-30 minutes. Which is too long for a default > behaviour. > Using 0 as the default value for timeout is incorrect. We should use a sane > value for the timeout and the "ipc.ping.interval" configuration value is a > logical choice for it. The default behaviour should be changed from 0 to the > value read for the ping interval from the Configuration. > Fixing it in common makes more sense than finding and changing all other > points in the code that do not pass in a timeout. > Offending code lines: > https://github.com/apache/hadoop/blob/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/ipc/RPC.java#L488 > and > https://github.com/apache/hadoop/blob/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/ipc/RPC.java#L350 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HADOOP-11252) RPC client write does not time out by default
[ https://issues.apache.org/jira/browse/HADOOP-11252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14192331#comment-14192331 ] Karthik Kambatla commented on HADOOP-11252: --- Can we make sure this patch undoes the changes in HDFS-4858, as we will be fixing in common? Thanks. > RPC client write does not time out by default > - > > Key: HADOOP-11252 > URL: https://issues.apache.org/jira/browse/HADOOP-11252 > Project: Hadoop Common > Issue Type: Bug > Components: ipc >Affects Versions: 2.5.0 >Reporter: Wilfred Spiegelenburg > > The RPC client has a default timeout set to 0 when no timeout is passed in. > This means that the network connection created will not timeout when used to > write data. The issue has shown in YARN-2578 and HDFS-4858. Timeouts for > writes then fall back to the tcp level retry (configured via tcp_retries2) > and timeouts between the 15-30 minutes. Which is too long for a default > behaviour. > Using 0 as the default value for timeout is incorrect. We should use a sane > value for the timeout and the "ipc.ping.interval" configuration value is a > logical choice for it. The default behaviour should be changed from 0 to the > value read for the ping interval from the Configuration. > Fixing it in common makes more sense than finding and changing all other > points in the code that do not pass in a timeout. > Offending code lines: > https://github.com/apache/hadoop/blob/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/ipc/RPC.java#L488 > and > https://github.com/apache/hadoop/blob/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/ipc/RPC.java#L350 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HADOOP-11252) RPC client write does not time out by default
[ https://issues.apache.org/jira/browse/HADOOP-11252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14191944#comment-14191944 ] Ming Ma commented on HADOOP-11252: -- Agree with Wilfred that we should fix it in hadoop-common. YARN-2714 is another example. While HDFS-4858 has addressed the issue between DN and NN, we still have issue between client and NN. 1. Have a client machine make a RPC request to NN. 2. Power off NN while NN is processing the RPC. 3. The client machine can detect connection issue in around 16 minutes. > RPC client write does not time out by default > - > > Key: HADOOP-11252 > URL: https://issues.apache.org/jira/browse/HADOOP-11252 > Project: Hadoop Common > Issue Type: Bug > Components: ipc >Affects Versions: 2.5.0 >Reporter: Wilfred Spiegelenburg > > The RPC client has a default timeout set to 0 when no timeout is passed in. > This means that the network connection created will not timeout when used to > write data. The issue has shown in YARN-2578 and HDFS-4858. Timeouts for > writes then fall back to the tcp level retry (configured via tcp_retries2) > and timeouts between the 15-30 minutes. Which is too long for a default > behaviour. > Using 0 as the default value for timeout is incorrect. We should use a sane > value for the timeout and the "ipc.ping.interval" configuration value is a > logical choice for it. The default behaviour should be changed from 0 to the > value read for the ping interval from the Configuration. > Fixing it in common makes more sense than finding and changing all other > points in the code that do not pass in a timeout. > Offending code lines: > https://github.com/apache/hadoop/blob/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/ipc/RPC.java#L488 > and > https://github.com/apache/hadoop/blob/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/ipc/RPC.java#L350 -- This message was sent by Atlassian JIRA (v6.3.4#6332)