Thanks for all the help. Following the advice, I have fixed the problem.

> 2020年2月13日 下午6:05,Zhijiang <wangzhijiang...@aliyun.com> 写道:
> 
> Thanks for reporting this issue and I also agree with the below analysis. 
> Actually we encountered the same issue several years ago and solved it also 
> via the netty idle handler.
> 
> Let's trace it via the ticket [1] as the following step.
> 
> [1] https://issues.apache.org/jira/browse/FLINK-16030 
> <https://issues.apache.org/jira/browse/FLINK-16030>
> 
> Best,
> Zhijiang
> 
> ------------------------------------------------------------------
> From:张光辉 <beggingh...@gmail.com>
> Send Time:2020 Feb. 12 (Wed.) 22:19
> To:Benchao Li <libenc...@gmail.com>
> Cc:刘建刚 <liujiangangp...@gmail.com>; user <user@flink.apache.org>
> Subject:Re: Encountered error while consuming partitions
> 
> Network can fail in many ways, sometimes pretty subtle (e.g. high ratio 
> packet loss). 
> 
> The problem is that the long tcp connection between netty client and server 
> is lost, then the server failed to send message to the client, and shut down 
> the channel. The Netty Client  does not know that the connection has been 
> disconnected, so it has been waiting. 
> 
> To detect long tcp connection alive on netty client and server, we should 
> have two ways: tcp keepalives and heartbeat.
> Tcp keepalives is 2 hours by default. When the error occurs, if you continue 
> to wait for 2 hours, the netty client will trigger exception and enter 
> failover recovery.
> If you want to detect long tcp connection quickly, netty provides 
> IdleStateHandler which it use ping-pang mechanism. If netty client send 
> continuously n ping message and receive no one pang message, then trigger 
> exception.
>  <mailto:libenc...@pku.edu.cn>
> 

Reply via email to