begginghard created FLINK-16030:
-----------------------------------
Summary: Add heartbeat between netty server and client to detect
long connection alive
Key: FLINK-16030
URL: https://issues.apache.org/jira/browse/FLINK-16030
Project: Flink
Issue Type: Improvement
Components: Runtime / Network
Affects Versions: 1.10.0
Reporter: begginghard
Fix For: 1.10.1
Network can fail in many ways, sometimes pretty subtle (e.g. high ratio packet
loss).
When the long tcp connection between netty client and server is lost, the
server would failed to send response to the client, then shut down the channel.
At the same time, the netty client does not know that the connection has been
disconnected, so it has been waiting for two hours.
To detect the long tcp connection alive on netty client and server, we should
have two ways: tcp keepalive and heartbeat.
The tcp keepalive is 2 hours by default. When the long tcp connection dead, you
continue to wait for 2 hours, the netty client will trigger exception and enter
failover recovery.
If you want to detect quickly, netty provides IdleStateHandler which it use
ping-pang mechanism. If netty client sends continuously n ping message and
receives no one pang message, then trigger exception.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)