Hi Zhilong,

Thanks a lot for your very detailed answer!

My setup: Flink 1.14.0 on YARN, jdk1.8_u202

The timeout happens at the job deployment stage. I checked GC logs, both JM and 
TM look good, but the CPU usage of JM could go up to 2000% for a short time 
(cgroups are not turned on). 

I’ve set the akka timeout to 60s as a workaround, and the job now runs well.  
I’m planning to dig deeper using profile and will get back to the community if 
I find something.

Best,
Paul Lam

> 2022年1月20日 20:09,Zhilong Hong <zhlongh...@gmail.com> 写道:
> 
> Hi, Paul:
> 
> Increasing akka.ask.timeout only covers up the issue. Maybe you could try to 
> find the root cause why an akka timeout happens.
> 
> There are many reasons that could lead to an akka timeout:
> 
> 1. JM/TM cannot respond in time. Maybe JM/TM is busy with GC. You could 
> analyze the situation of GC according to the documentation [1]. If long-term 
> GC happens during runtime, you could try to increase the heap memory or 
> increase the parallelism. 
> 
> Maybe there exists a machine that has a high CPU load. If the CPU load is too 
> high, the main thread may not be able to process the akka messages in time. 
> You could monitor the CPU usage during the runtime of your jobs with commands 
> like top. 
> 
> I'm wondering what version of Flink you are using? And when did an akka 
> timeout happen? If it happened during the deployment, you could try to 
> upgrade your Flink to 1.14 for better deployment performance.
> 
> 2. The network congestion. If the situation of the network in your cluster is 
> awful, the akka message cannot arrive at its destination in time, and an akka 
> timeout happens. You could monitor the network traffic with tools mentioned 
> in [2]. 
> 
> If you are trying to increase the value of akka.ask.timeout, you could 
> increase 10 seconds each time and see whether it works.
> 
> Sincerely,
> Zhilong
> 
> [1] 
> https://nightlies.apache.org/flink/flink-docs-master/docs/ops/debugging/application_profiling/#analyzing-memory--garbage-collection-behaviour
>  
> <https://nightlies.apache.org/flink/flink-docs-master/docs/ops/debugging/application_profiling/#analyzing-memory--garbage-collection-behaviour>
> [2] 
> https://askubuntu.com/questions/257263/how-to-display-network-traffic-in-the-terminal
>  
> <https://askubuntu.com/questions/257263/how-to-display-network-traffic-in-the-terminal>
> On Thu, Jan 20, 2022 at 4:45 PM Paul Lam <paullin3...@gmail.com 
> <mailto:paullin3...@gmail.com>> wrote:
> Hi,
> 
> I’m tuning a Flink job with 1000+ parallelism, which frequently fails with 
> Akka TimeOutException (it was fine with 200 parallelism). 
> 
> I see some posts recommend increasing `akka.ask.timeout` to 120s. I’m not 
> familiar with Akka but it looks like a very long time compared to the default 
> 10s and as a response timeout.
> 
> So I’m wondering what’s the reasonable range for this option? And why would 
> the Actor fail to respond in time (the message was dropped due to pressure)?
> 
> Any input would be appreciated! Thanks a lot.
> 
> Best,
> Paul Lam
> 

Reply via email to