Thanks for reporting this issue and I also agree with the below analysis. Actually we encountered the same issue several years ago and solved it also via the netty idle handler.
Let's trace it via the ticket [1] as the following step. [1] https://issues.apache.org/jira/browse/FLINK-16030 Best, Zhijiang ------------------------------------------------------------------ From:张光辉 <beggingh...@gmail.com> Send Time:2020 Feb. 12 (Wed.) 22:19 To:Benchao Li <libenc...@gmail.com> Cc:刘建刚 <liujiangangp...@gmail.com>; user <user@flink.apache.org> Subject:Re: Encountered error while consuming partitions Network can fail in many ways, sometimes pretty subtle (e.g. high ratio packet loss). The problem is that the long tcp connection between netty client and server is lost, then the server failed to send message to the client, and shut down the channel. The Netty Client does not know that the connection has been disconnected, so it has been waiting. To detect long tcp connection alive on netty client and server, we should have two ways: tcp keepalives and heartbeat. Tcp keepalives is 2 hours by default. When the error occurs, if you continue to wait for 2 hours, the netty client will trigger exception and enter failover recovery. If you want to detect long tcp connection quickly, netty provides IdleStateHandler which it use ping-pang mechanism. If netty client send continuously n ping message and receive no one pang message, then trigger exception.