Great, thanks a lot Weike. I think the first step would be to open a JIRA issue, get assigned and then start on fixing it and opening a PR.
Cheers, Till On Fri, Oct 16, 2020 at 10:02 AM DONG, Weike <kyled...@connect.hku.hk> wrote: > Hi all, > > Thanks for all the replies, and I agree with Yang, as we have found that > for a pod without a service (like TaskManager pod), the reverse DNS lookup > would always fail, so this lookup is not necessary for the Kubernetes > environment. > > I am glad to help fix this issue to make Flink better : ) > > Best, > Weike > > On Thu, Oct 15, 2020 at 7:57 PM Till Rohrmann <trohrm...@apache.org> > wrote: > >> Hi Weike, >> >> thanks for getting back to us with your findings. Looking at the >> `TaskManagerLocation`, we are actually calling >> `InetAddress.getCanonicalHostName` twice for every creation of a >> `TaskManagerLocation` instance. This does not look right. >> >> I think it should be fine to make the look up configurable. Moreover, one >> could think about only doing a lazy look up if the canonical hostname is >> really needed (as far as I can see it is only really needed input split >> assignments and for the LocationPreferenceSlotSelectionStrategy to >> calculate how many TMs run on the same machine). >> >> Do you want to fix this issue? >> >> Cheers, >> Till >> >> On Thu, Oct 15, 2020 at 11:38 AM DONG, Weike <kyled...@connect.hku.hk> >> wrote: >> >>> Hi Till and community, >>> >>> By the way, initially I resolved the IPs several times but results >>> returned rather quickly (less than 1ms, possibly due to DNS cache on the >>> server), so I thought it might not be the DNS issue. >>> >>> However, after debugging and logging, it is found that the lookup time >>> exhibited high variance, i. e. normally it completes fast but occasionally >>> some slow results would block the thread. So an unstable DNS server might >>> have a great impact on the performance of Flink job startup. >>> >>> Best, >>> Weike >>> >>> On Thu, Oct 15, 2020 at 5:19 PM DONG, Weike <kyled...@connect.hku.hk> >>> wrote: >>> >>>> Hi Till and community, >>>> >>>> Increasing `kubernetes.jobmanager.cpu` in the configuration makes this >>>> issue alleviated but not disappeared. >>>> >>>> After adding DEBUG logs to the internals of *flink-runtime*, we have >>>> found the culprit is >>>> >>>> inetAddress.getCanonicalHostName() >>>> >>>> in >>>> *org.apache.flink.runtime.taskmanager.TaskManagerLocation#getHostName* >>>> and >>>> *org.apache.flink.runtime.taskmanager.TaskManagerLocation#getFqdnHostName*, >>>> which could take ~ 6 seconds to complete, thus Akka dispatcher(s) >>>> are severely blocked by that. >>>> >>>> By commenting out the two methods, this issue seems to be solved >>>> immediately, so I wonder if Flink could provide a configuration parameter >>>> to turn off the DNS reverse lookup process, as it seems that Flink jobs >>>> could run happily without it. >>>> >>>> Sincerely, >>>> Weike >>>> >>>> >>>> On Tue, Oct 13, 2020 at 6:52 PM Till Rohrmann <trohrm...@apache.org> >>>> wrote: >>>> >>>>> Hi Weike, >>>>> >>>>> could you try setting kubernetes.jobmanager.cpu: 4 in your >>>>> flink-conf.yaml? I fear that a single CPU is too low for the JobManager >>>>> component. >>>>> >>>>> Cheers, >>>>> Till >>>>> >>>>> On Tue, Oct 13, 2020 at 11:33 AM Till Rohrmann <trohrm...@apache.org> >>>>> wrote: >>>>> >>>>>> Hi Weike, >>>>>> >>>>>> thanks for posting the logs. I will take a look at them. My suspicion >>>>>> would be that there is some operation blocking the JobMaster's main >>>>>> thread >>>>>> which causes the registrations from the TMs to time out. Maybe the logs >>>>>> allow me to validate/falsify this suspicion. >>>>>> >>>>>> Cheers, >>>>>> Till >>>>>> >>>>>> On Mon, Oct 12, 2020 at 10:43 AM DONG, Weike <kyled...@connect.hku.hk> >>>>>> wrote: >>>>>> >>>>>>> Hi community, >>>>>>> >>>>>>> I have uploaded the log files of JobManager and TaskManager-1-1 (one >>>>>>> of the 50 TaskManagers) with DEBUG log level and default Flink >>>>>>> configuration, and it clearly shows that TaskManager failed to register >>>>>>> with JobManager after 10 attempts. >>>>>>> >>>>>>> Here is the link: >>>>>>> >>>>>>> JobManager: >>>>>>> https://gist.github.com/kylemeow/740c470d9b5a1ab3552376193920adce >>>>>>> >>>>>>> TaskManager-1-1: >>>>>>> https://gist.github.com/kylemeow/41b9a8fe91975875c40afaf58276c2fe >>>>>>> >>>>>>> Thanks : ) >>>>>>> >>>>>>> Best regards, >>>>>>> Weike >>>>>>> >>>>>>> >>>>>>> On Mon, Oct 12, 2020 at 4:14 PM DONG, Weike <kyled...@connect.hku.hk> >>>>>>> wrote: >>>>>>> >>>>>>>> Hi community, >>>>>>>> >>>>>>>> Recently we have noticed a strange behavior for Flink jobs on >>>>>>>> Kubernetes per-job mode: when the parallelism increases, the time it >>>>>>>> takes >>>>>>>> for the TaskManagers to register with *JobManager *becomes >>>>>>>> abnormally long (for a task with parallelism of 50, it could take 60 ~ >>>>>>>> 120 >>>>>>>> seconds or even longer for the registration attempt), and usually more >>>>>>>> than >>>>>>>> 10 attempts are needed to finish this registration. >>>>>>>> >>>>>>>> Because of this, we could not submit a job requiring more than 20 >>>>>>>> slots with the default configuration, as the TaskManager would say: >>>>>>>> >>>>>>>> >>>>>>>>> Registration at JobManager >>>>>>>>> (akka.tcp://flink@myjob-201076.default:6123/user/rpc/jobmanager_2) >>>>>>>>> attempt 9 timed out after 25600 ms >>>>>>>> >>>>>>>> Free slot with allocation id 60d5277e138a94fb73fc6691557001e0 >>>>>>>>> because: The slot 60d5277e138a94fb73fc6691557001e0 has timed out. >>>>>>>> >>>>>>>> Free slot TaskSlot(index:0, state:ALLOCATED, resource profile: >>>>>>>>> ResourceProfile{cpuCores=1.0000000000000000, taskHeapMemory=1.425gb >>>>>>>>> (1530082070 bytes), taskOffHeapMemory=0 bytes, managedMemory=1.340gb >>>>>>>>> (1438814063 bytes), networkMemory=343.040mb (359703515 bytes)}, >>>>>>>>> allocationId: 60d5277e138a94fb73fc6691557001e0, jobId: >>>>>>>>> 493cd86e389ccc8f2887e1222903b5ce). >>>>>>>>> java.lang.Exception: The slot 60d5277e138a94fb73fc6691557001e0 has >>>>>>>>> timed out. >>>>>>>> >>>>>>>> >>>>>>>> In order to cope with this issue, we have to change the below >>>>>>>> configuration parameters: >>>>>>>> >>>>>>>>> >>>>>>>>> # Prevent "Could not allocate the required slot within slot >>>>>>>>> request timeout. Please make sure that the cluster has enough >>>>>>>>> resources. >>>>>>>>> Stopping the JobMaster for job" >>>>>>>>> slot.request.timeout: 500000 >>>>>>>> >>>>>>>> # Increase max timeout in a single attempt >>>>>>>>> cluster.registration.max-timeout: 300000 >>>>>>>>> # Prevent "free slot (TaskSlot)" >>>>>>>>> akka.ask.timeout: 10 min >>>>>>>>> # Prevent "Heartbeat of TaskManager timed out." >>>>>>>>> heartbeat.timeout: 500000 >>>>>>>> >>>>>>>> >>>>>>>> However, we acknowledge that this is only a temporary dirty fix, >>>>>>>> which is not what we want. It could be seen that during TaskManager >>>>>>>> registration to JobManager, lots of warning messages come out in logs: >>>>>>>> >>>>>>>> No hostname could be resolved for the IP address 9.166.0.118, using >>>>>>>>> IP address as host name. Local input split assignment (such as for >>>>>>>>> HDFS >>>>>>>>> files) may be impacted. >>>>>>>> >>>>>>>> >>>>>>>> Initially we thought this was probably the cause (reverse lookup of >>>>>>>> DNS might take up a long time), however we later found that the reverse >>>>>>>> lookup only took less than 1ms, so maybe not because of this. >>>>>>>> >>>>>>>> Also, we have checked the GC log of both TaskManagers and >>>>>>>> JobManager, and they seem to be perfectly normal, without any signs of >>>>>>>> pauses. And the heartbeats are processed as normal according to the >>>>>>>> logs. >>>>>>>> >>>>>>>> Moreover, TaskManagers register quickly with ResourceManager, but >>>>>>>> then extra slow with TaskManager, so this is not because of a slow >>>>>>>> network >>>>>>>> connection. >>>>>>>> >>>>>>>> Here we wonder what could be the cause for the slow registration >>>>>>>> between JobManager and TaskManager(s)? No other warning or error >>>>>>>> messages >>>>>>>> in the log (DEBUG level) other than the "No hostname could be resolved" >>>>>>>> messages, which is quite weird. >>>>>>>> >>>>>>>> Thanks for the reading, and hope to get some insights into this >>>>>>>> issues : ) >>>>>>>> >>>>>>>> Sincerely, >>>>>>>> Weike >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>