[ https://issues.apache.org/jira/browse/FLINK-16030?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17037094#comment-17037094 ]
Stephan Ewen commented on FLINK-16030: -------------------------------------- We had a discussion some time ago about "richer exception handling" on the Job Manager. For example when TM1 and TM2 are communicating, and TM2 is crashing, often the first exception is that TM1 reports a "loss of connection with TM2" from Netty. When recovery is started, the heartbeats have not timed out, so the JM tries to deploy again to TM2. That deploy typically fails (ask timeout). Then eventually the heatbeat times out and TM2 is removed. Then the redeploy is successful. It prolongs recovery time that we need to wait for a heartbeat timeout from TM2 to understand that it is lost. What we could do is make more use of exception information. For example if TM1 reports a connection failure with TM2, we can use that to either cancel the corresponding task on TM2, or we can "graylist" TM2 until it reports proper running status again. Just bringing this up, because these things seem to go into a similar direction. > Add heartbeat between netty server and client to detect long connection alive > ----------------------------------------------------------------------------- > > Key: FLINK-16030 > URL: https://issues.apache.org/jira/browse/FLINK-16030 > Project: Flink > Issue Type: Improvement > Components: Runtime / Network > Affects Versions: 1.7.2, 1.8.3, 1.9.2, 1.10.0 > Reporter: begginghard > Assignee: begginghard > Priority: Major > > As reported on [the user mailing > list|https://lists.apache.org/list.html?u...@flink.apache.org:lte=1M:Encountered%20error%20while%20consuming%20partitions] > Network can fail in many ways, sometimes pretty subtle (e.g. high ratio > packet loss). > When the long tcp connection between netty client and server is lost, the > server would failed to send response to the client, then shut down the > channel. At the same time, the netty client does not know that the connection > has been disconnected, so it has been waiting for two hours. > To detect the long tcp connection alive on netty client and server, we should > have two ways: tcp keepalive and heartbeat. > > The tcp keepalive is 2 hours by default. When the long tcp connection dead, > you continue to wait for 2 hours, the netty client will trigger exception and > enter failover recovery. > If you want to detect quickly, netty provides IdleStateHandler which it use > ping-pang mechanism. If netty client sends continuously n ping message and > receives no one pang message, then trigger exception. > -- This message was sent by Atlassian Jira (v8.3.4#803005)