Hi Till and community, Increasing `kubernetes.jobmanager.cpu` in the configuration makes this issue alleviated but not disappeared.
After adding DEBUG logs to the internals of *flink-runtime*, we have found the culprit is inetAddress.getCanonicalHostName() in *org.apache.flink.runtime.taskmanager.TaskManagerLocation#getHostName* and *org.apache.flink.runtime.taskmanager.TaskManagerLocation#getFqdnHostName*, which could take ~ 6 seconds to complete, thus Akka dispatcher(s) are severely blocked by that. By commenting out the two methods, this issue seems to be solved immediately, so I wonder if Flink could provide a configuration parameter to turn off the DNS reverse lookup process, as it seems that Flink jobs could run happily without it. Sincerely, Weike On Tue, Oct 13, 2020 at 6:52 PM Till Rohrmann <trohrm...@apache.org> wrote: > Hi Weike, > > could you try setting kubernetes.jobmanager.cpu: 4 in your > flink-conf.yaml? I fear that a single CPU is too low for the JobManager > component. > > Cheers, > Till > > On Tue, Oct 13, 2020 at 11:33 AM Till Rohrmann <trohrm...@apache.org> > wrote: > >> Hi Weike, >> >> thanks for posting the logs. I will take a look at them. My suspicion >> would be that there is some operation blocking the JobMaster's main thread >> which causes the registrations from the TMs to time out. Maybe the logs >> allow me to validate/falsify this suspicion. >> >> Cheers, >> Till >> >> On Mon, Oct 12, 2020 at 10:43 AM DONG, Weike <kyled...@connect.hku.hk> >> wrote: >> >>> Hi community, >>> >>> I have uploaded the log files of JobManager and TaskManager-1-1 (one of >>> the 50 TaskManagers) with DEBUG log level and default Flink configuration, >>> and it clearly shows that TaskManager failed to register with JobManager >>> after 10 attempts. >>> >>> Here is the link: >>> >>> JobManager: >>> https://gist.github.com/kylemeow/740c470d9b5a1ab3552376193920adce >>> >>> TaskManager-1-1: >>> https://gist.github.com/kylemeow/41b9a8fe91975875c40afaf58276c2fe >>> >>> Thanks : ) >>> >>> Best regards, >>> Weike >>> >>> >>> On Mon, Oct 12, 2020 at 4:14 PM DONG, Weike <kyled...@connect.hku.hk> >>> wrote: >>> >>>> Hi community, >>>> >>>> Recently we have noticed a strange behavior for Flink jobs on >>>> Kubernetes per-job mode: when the parallelism increases, the time it takes >>>> for the TaskManagers to register with *JobManager *becomes abnormally >>>> long (for a task with parallelism of 50, it could take 60 ~ 120 seconds or >>>> even longer for the registration attempt), and usually more than 10 >>>> attempts are needed to finish this registration. >>>> >>>> Because of this, we could not submit a job requiring more than 20 slots >>>> with the default configuration, as the TaskManager would say: >>>> >>>> >>>>> Registration at JobManager >>>>> (akka.tcp://flink@myjob-201076.default:6123/user/rpc/jobmanager_2) >>>>> attempt 9 timed out after 25600 ms >>>> >>>> Free slot with allocation id 60d5277e138a94fb73fc6691557001e0 because: >>>>> The slot 60d5277e138a94fb73fc6691557001e0 has timed out. >>>> >>>> Free slot TaskSlot(index:0, state:ALLOCATED, resource profile: >>>>> ResourceProfile{cpuCores=1.0000000000000000, taskHeapMemory=1.425gb >>>>> (1530082070 bytes), taskOffHeapMemory=0 bytes, managedMemory=1.340gb >>>>> (1438814063 bytes), networkMemory=343.040mb (359703515 bytes)}, >>>>> allocationId: 60d5277e138a94fb73fc6691557001e0, jobId: >>>>> 493cd86e389ccc8f2887e1222903b5ce). >>>>> java.lang.Exception: The slot 60d5277e138a94fb73fc6691557001e0 has >>>>> timed out. >>>> >>>> >>>> In order to cope with this issue, we have to change the below >>>> configuration parameters: >>>> >>>>> >>>>> # Prevent "Could not allocate the required slot within slot request >>>>> timeout. Please make sure that the cluster has enough resources. Stopping >>>>> the JobMaster for job" >>>>> slot.request.timeout: 500000 >>>> >>>> # Increase max timeout in a single attempt >>>>> cluster.registration.max-timeout: 300000 >>>>> # Prevent "free slot (TaskSlot)" >>>>> akka.ask.timeout: 10 min >>>>> # Prevent "Heartbeat of TaskManager timed out." >>>>> heartbeat.timeout: 500000 >>>> >>>> >>>> However, we acknowledge that this is only a temporary dirty fix, which >>>> is not what we want. It could be seen that during TaskManager registration >>>> to JobManager, lots of warning messages come out in logs: >>>> >>>> No hostname could be resolved for the IP address 9.166.0.118, using IP >>>>> address as host name. Local input split assignment (such as for HDFS >>>>> files) >>>>> may be impacted. >>>> >>>> >>>> Initially we thought this was probably the cause (reverse lookup of DNS >>>> might take up a long time), however we later found that the reverse lookup >>>> only took less than 1ms, so maybe not because of this. >>>> >>>> Also, we have checked the GC log of both TaskManagers and JobManager, >>>> and they seem to be perfectly normal, without any signs of pauses. And the >>>> heartbeats are processed as normal according to the logs. >>>> >>>> Moreover, TaskManagers register quickly with ResourceManager, but then >>>> extra slow with TaskManager, so this is not because of a slow network >>>> connection. >>>> >>>> Here we wonder what could be the cause for the slow registration >>>> between JobManager and TaskManager(s)? No other warning or error messages >>>> in the log (DEBUG level) other than the "No hostname could be resolved" >>>> messages, which is quite weird. >>>> >>>> Thanks for the reading, and hope to get some insights into this issues >>>> : ) >>>> >>>> Sincerely, >>>> Weike >>>> >>>> >>>> >>>