NativeIoException PartitionRequestQueue - Encountered error while consuming partitions

2022-10-11 Thread Clayton Wohl
I have a streaming Flink job that runs 24/7 on a Kubernetes cluster hosted
in AWS. Every few weeks or sometimes months, the job fails down with
network errors like the following error in the logs. This is with Flink
1.14.5.

Is there anything that I can do to help my application automatically retry
and recover from this type of error. Do newer versions of Flink possibly
make this issue any better?

org.apache.flink.shaded.netty4.io.netty.channel.unix.Errors$NativeIoException:
readAddress(..) failed: Connection reset by peer
19:13:57.893 [Flink Netty Server (0) Thread 0] ERROR
org.apache.flink.runtime.io.network.netty.PartitionRequestQueue -
Encountered error while consuming partitions
org.apache.flink.shaded.netty4.io.netty.channel.unix.Errors$NativeIoException:
readAddress(..) failed: Connection reset by peer
19:13:57.894 [Flink Netty Server (0) Thread 0] ERROR
org.apache.flink.runtime.io.network.netty.PartitionRequestQueue -
Encountered error while consuming partitions
org.apache.flink.shaded.netty4.io.netty.channel.unix.Errors$NativeIoException:
readAddress(..) failed: Connection reset by peer

I see several questions similar to this on stackoverflow with no helpful
answers.

Thank you for any help.


Re: Encountered error while consuming partitions

2020-02-13 Thread Piotr Nowojski
Hi 刘建刚,

Could you explain how did you fix the problem for your case? Did you modify 
Flink code to use `IdleStateHandler`?

Piotrek

> On 13 Feb 2020, at 11:10, 刘建刚  wrote:
> 
> Thanks for all the help. Following the advice, I have fixed the problem.
> 
>> 2020年2月13日 下午6:05,Zhijiang > <mailto:wangzhijiang...@aliyun.com>> 写道:
>> 
>> Thanks for reporting this issue and I also agree with the below analysis. 
>> Actually we encountered the same issue several years ago and solved it also 
>> via the netty idle handler.
>> 
>> Let's trace it via the ticket [1] as the following step.
>> 
>> [1] https://issues.apache.org/jira/browse/FLINK-16030 
>> <https://issues.apache.org/jira/browse/FLINK-16030>
>> 
>> Best,
>> Zhijiang
>> 
>> --
>> From:张光辉 mailto:beggingh...@gmail.com>>
>> Send Time:2020 Feb. 12 (Wed.) 22:19
>> To:Benchao Li mailto:libenc...@gmail.com>>
>> Cc:刘建刚 mailto:liujiangangp...@gmail.com>>; user 
>> mailto:user@flink.apache.org>>
>> Subject:Re: Encountered error while consuming partitions
>> 
>> Network can fail in many ways, sometimes pretty subtle (e.g. high ratio 
>> packet loss). 
>> 
>> The problem is that the long tcp connection between netty client and server 
>> is lost, then the server failed to send message to the client, and shut down 
>> the channel. The Netty Client  does not know that the connection has been 
>> disconnected, so it has been waiting. 
>> 
>> To detect long tcp connection alive on netty client and server, we should 
>> have two ways: tcp keepalives and heartbeat.
>> Tcp keepalives is 2 hours by default. When the error occurs, if you continue 
>> to wait for 2 hours, the netty client will trigger exception and enter 
>> failover recovery.
>> If you want to detect long tcp connection quickly, netty provides 
>> IdleStateHandler which it use ping-pang mechanism. If netty client send 
>> continuously n ping message and receive no one pang message, then trigger 
>> exception.
>>  <mailto:libenc...@pku.edu.cn>
>> 
> 



Re: Encountered error while consuming partitions

2020-02-13 Thread 刘建刚
Thanks for all the help. Following the advice, I have fixed the problem.

> 2020年2月13日 下午6:05,Zhijiang  写道:
> 
> Thanks for reporting this issue and I also agree with the below analysis. 
> Actually we encountered the same issue several years ago and solved it also 
> via the netty idle handler.
> 
> Let's trace it via the ticket [1] as the following step.
> 
> [1] https://issues.apache.org/jira/browse/FLINK-16030 
> <https://issues.apache.org/jira/browse/FLINK-16030>
> 
> Best,
> Zhijiang
> 
> --
> From:张光辉 
> Send Time:2020 Feb. 12 (Wed.) 22:19
> To:Benchao Li 
> Cc:刘建刚 ; user 
> Subject:Re: Encountered error while consuming partitions
> 
> Network can fail in many ways, sometimes pretty subtle (e.g. high ratio 
> packet loss). 
> 
> The problem is that the long tcp connection between netty client and server 
> is lost, then the server failed to send message to the client, and shut down 
> the channel. The Netty Client  does not know that the connection has been 
> disconnected, so it has been waiting. 
> 
> To detect long tcp connection alive on netty client and server, we should 
> have two ways: tcp keepalives and heartbeat.
> Tcp keepalives is 2 hours by default. When the error occurs, if you continue 
> to wait for 2 hours, the netty client will trigger exception and enter 
> failover recovery.
> If you want to detect long tcp connection quickly, netty provides 
> IdleStateHandler which it use ping-pang mechanism. If netty client send 
> continuously n ping message and receive no one pang message, then trigger 
> exception.
>  <mailto:libenc...@pku.edu.cn>
> 



Re: Encountered error while consuming partitions

2020-02-13 Thread Zhijiang
Thanks for reporting this issue and I also agree with the below analysis. 
Actually we encountered the same issue several years ago and solved it also via 
the netty idle handler.

Let's trace it via the ticket [1] as the following step.

[1] https://issues.apache.org/jira/browse/FLINK-16030

Best,
Zhijiang


--
From:张光辉 
Send Time:2020 Feb. 12 (Wed.) 22:19
To:Benchao Li 
Cc:刘建刚 ; user 
Subject:Re: Encountered error while consuming partitions

Network can fail in many ways, sometimes pretty subtle (e.g. high ratio packet 
loss). 

The problem is that the long tcp connection between netty client and server is 
lost, then the server failed to send message to the client, and shut down the 
channel. The Netty Client  does not know that the connection has been 
disconnected, so it has been waiting. 

To detect long tcp connection alive on netty client and server, we should have 
two ways: tcp keepalives and heartbeat.
Tcp keepalives is 2 hours by default. When the error occurs, if you continue to 
wait for 2 hours, the netty client will trigger exception and enter failover 
recovery.
If you want to detect long tcp connection quickly, netty provides 
IdleStateHandler which it use ping-pang mechanism. If netty client send 
continuously n ping message and receive no one pang message, then trigger 
exception.