[ https://issues.apache.org/jira/browse/FLINK-16030?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Zhijiang updated FLINK-16030: ----------------------------- Fix Version/s: (was: 1.10.1) > Add heartbeat between netty server and client to detect long connection alive > ----------------------------------------------------------------------------- > > Key: FLINK-16030 > URL: https://issues.apache.org/jira/browse/FLINK-16030 > Project: Flink > Issue Type: Improvement > Components: Runtime / Network > Affects Versions: 1.10.0 > Reporter: begginghard > Priority: Major > > Network can fail in many ways, sometimes pretty subtle (e.g. high ratio > packet loss). > When the long tcp connection between netty client and server is lost, the > server would failed to send response to the client, then shut down the > channel. At the same time, the netty client does not know that the connection > has been disconnected, so it has been waiting for two hours. > To detect the long tcp connection alive on netty client and server, we should > have two ways: tcp keepalive and heartbeat. > > The tcp keepalive is 2 hours by default. When the long tcp connection dead, > you continue to wait for 2 hours, the netty client will trigger exception and > enter failover recovery. > If you want to detect quickly, netty provides IdleStateHandler which it use > ping-pang mechanism. If netty client sends continuously n ping message and > receives no one pang message, then trigger exception. > -- This message was sent by Atlassian Jira (v8.3.4#803005)