[ 
https://issues.apache.org/jira/browse/HADOOP-11252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14192636#comment-14192636
 ] 

Daryn Sharp commented on HADOOP-11252:
--------------------------------------

I agree that an optional write timeout is good, but I don't agree that 
{{ipc.ping.interval}} should be reused.  It's for detecting broken connections 
or timing out when there are outstanding calls.  There's an existing 
{{ipc.client.connect.timeout}} key so {{ipc.client.write.timeout}} would be a 
logical choice.

I understand this change is for reducing failover latency with config-based HA. 
 But it adds fast-fail in cases where it's not desired.  Ex. If the NN is in 
GC, the last thing you want is for clients to repeatedly timeout & reconnect, 
overflow the listen queue, etc.

During a network cut of both NNs, clients may burn through their retries 
prematurely or exponentially fall back too far.  Or with IP-failover based HA, 
you _want_ the clients to wait.  When the standby assumes the IP, the 
connections break, and the clients reconnect.

Whether you set the write timeout is based on if you favor jobs succeeding at 
all reasonable costs, or you want fast-fail that many apps won't handle well.

> RPC client write does not time out by default
> ---------------------------------------------
>
>                 Key: HADOOP-11252
>                 URL: https://issues.apache.org/jira/browse/HADOOP-11252
>             Project: Hadoop Common
>          Issue Type: Bug
>          Components: ipc
>    Affects Versions: 2.5.0
>            Reporter: Wilfred Spiegelenburg
>            Priority: Critical
>
> The RPC client has a default timeout set to 0 when no timeout is passed in. 
> This means that the network connection created will not timeout when used to 
> write data. The issue has shown in YARN-2578 and HDFS-4858. Timeouts for 
> writes then fall back to the tcp level retry (configured via tcp_retries2) 
> and timeouts between the 15-30 minutes. Which is too long for a default 
> behaviour.
> Using 0 as the default value for timeout is incorrect. We should use a sane 
> value for the timeout and the "ipc.ping.interval" configuration value is a 
> logical choice for it. The default behaviour should be changed from 0 to the 
> value read for the ping interval from the Configuration.
> Fixing it in common makes more sense than finding and changing all other 
> points in the code that do not pass in a timeout.
> Offending code lines:
> https://github.com/apache/hadoop/blob/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/ipc/RPC.java#L488
> and 
> https://github.com/apache/hadoop/blob/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/ipc/RPC.java#L350



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to