[ 
https://issues.apache.org/jira/browse/IMPALA-6159?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16876999#comment-16876999
 ] 

ASF subversion and git services commented on IMPALA-6159:
---------------------------------------------------------

Commit 7206b52e5b6ab1df045e4249d859129d32aacf6b in impala's branch 
refs/heads/master from Michael Ho
[ https://gitbox.apache.org/repos/asf?p=impala.git;h=7206b52 ]

IMPALA-6159 / KUDU-2192: Enable TCP keepalive for all outbound connections

This change enables TCP keepalive for all outbound connections.
This aims to handle cases in which the remote peer may have
dropped off the network without sending a TCP RST. For instance,
a remote host could have hit a kernel panic and got power cycled.
In which case, the existing TCP connection to that host may be
stale. In an idle cluster, this stale connection may not be detected
until the next use of it, in which case it will result in a RPC
failure due to TCP RST sent from the restarted peer.

By enabling TCP keepalive, we ensure that stale TCP connections
in an idle cluster will be detected and closed within a time bound
so a new connection will be created on the next use. This change
introduces 3 different flags:

--tcp_keepalive_probe_period_s: the duration in seconds a TCP connection
has to be idle before keepalive probes started to be sent.

--tcp_keepalive_retry_period_s: the duration in seconds between successive
keepalive probes if previous probes didn't get an ACK from remote peer.

--tcp_keepalive_retry_count: the maximum number of TCP keepalive probes
sent without an ACK before declaring the remote peer as dead.

Testing:
- Used TCP dump to verify that keepalive probes are being sent periodically.
- Verified that blocking all incoming traffic to a server's port via an iptable
rule caused the TCP connection to be closed and the keepalive probes to stop
eventually.

Change-Id: Iaa1d66d83aea1cc82d07fc6217be5fc1306695bc
Reviewed-on: http://gerrit.cloudera.org:8080/13702
Reviewed-by: Alexey Serbin <aser...@cloudera.com>
Reviewed-by: Todd Lipcon <t...@apache.org>
Tested-by: Kudu Jenkins
Reviewed-on: http://gerrit.cloudera.org:8080/13764
Reviewed-by: Impala Public Jenkins <impala-public-jenk...@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenk...@cloudera.com>


> DataStreamSender should transparently handle some connection reset by peer
> --------------------------------------------------------------------------
>
>                 Key: IMPALA-6159
>                 URL: https://issues.apache.org/jira/browse/IMPALA-6159
>             Project: IMPALA
>          Issue Type: Sub-task
>          Components: Distributed Exec
>    Affects Versions: Impala 2.12.0
>            Reporter: Michael Ho
>            Assignee: Michael Ho
>            Priority: Critical
>
> A client to server KRPC connection can become stale if the socket was closed 
> on the server side due to various reasons such as idle connection removal or 
> remote Impalad restart. Currently, the KRPC code will invoke the callback of 
> all RPCs using that stale connection with the failed status (e.g. "Connection 
> reset by peer"). DataStreamSender should pattern match against certain error 
> string (as they are mostly output from strerror()) and retry the RPC 
> transparently. This may be also be useful for KUDU-2192 which tracks the 
> effort to detect stuck connection and close them. In which case, we may also 
> want to transparently retry the RPC
> FWIW, KUDU-279 is tracking the effort to have a cleaner protocol for 
> connection teardown due to idle client connection removal on the server side. 
> However, Impala still needs to handle other reasons for a stale connection.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org

Reply via email to