Hi Zhilong, Thanks a lot for your very detailed answer!
My setup: Flink 1.14.0 on YARN, jdk1.8_u202 The timeout happens at the job deployment stage. I checked GC logs, both JM and TM look good, but the CPU usage of JM could go up to 2000% for a short time (cgroups are not turned on). I’ve set the akka timeout to 60s as a workaround, and the job now runs well. I’m planning to dig deeper using profile and will get back to the community if I find something. Best, Paul Lam > 2022年1月20日 20:09,Zhilong Hong <zhlongh...@gmail.com> 写道: > > Hi, Paul: > > Increasing akka.ask.timeout only covers up the issue. Maybe you could try to > find the root cause why an akka timeout happens. > > There are many reasons that could lead to an akka timeout: > > 1. JM/TM cannot respond in time. Maybe JM/TM is busy with GC. You could > analyze the situation of GC according to the documentation [1]. If long-term > GC happens during runtime, you could try to increase the heap memory or > increase the parallelism. > > Maybe there exists a machine that has a high CPU load. If the CPU load is too > high, the main thread may not be able to process the akka messages in time. > You could monitor the CPU usage during the runtime of your jobs with commands > like top. > > I'm wondering what version of Flink you are using? And when did an akka > timeout happen? If it happened during the deployment, you could try to > upgrade your Flink to 1.14 for better deployment performance. > > 2. The network congestion. If the situation of the network in your cluster is > awful, the akka message cannot arrive at its destination in time, and an akka > timeout happens. You could monitor the network traffic with tools mentioned > in [2]. > > If you are trying to increase the value of akka.ask.timeout, you could > increase 10 seconds each time and see whether it works. > > Sincerely, > Zhilong > > [1] > https://nightlies.apache.org/flink/flink-docs-master/docs/ops/debugging/application_profiling/#analyzing-memory--garbage-collection-behaviour > > <https://nightlies.apache.org/flink/flink-docs-master/docs/ops/debugging/application_profiling/#analyzing-memory--garbage-collection-behaviour> > [2] > https://askubuntu.com/questions/257263/how-to-display-network-traffic-in-the-terminal > > <https://askubuntu.com/questions/257263/how-to-display-network-traffic-in-the-terminal> > On Thu, Jan 20, 2022 at 4:45 PM Paul Lam <paullin3...@gmail.com > <mailto:paullin3...@gmail.com>> wrote: > Hi, > > I’m tuning a Flink job with 1000+ parallelism, which frequently fails with > Akka TimeOutException (it was fine with 200 parallelism). > > I see some posts recommend increasing `akka.ask.timeout` to 120s. I’m not > familiar with Akka but it looks like a very long time compared to the default > 10s and as a response timeout. > > So I’m wondering what’s the reasonable range for this option? And why would > the Actor fail to respond in time (the message was dropped due to pressure)? > > Any input would be appreciated! Thanks a lot. > > Best, > Paul Lam >