I am afraid the InetAddress cache could not take effect. Because Kubernetes only creates A and SRV records for Services. It doesn't generate pods' A records as you may expect. Refer here[1][2] for more information. So the DNS reverse lookup will always fail. IIRC, the default timeout is 5s. This could explain the delay of "getHostName" or "getFqdnHostName".
I agree that we should add a config option to disable the DNS reverse lookup. [1]. https://kubernetes.io/docs/tasks/administer-cluster/dns-custom-nameservers/#coredns-configmap-options [2]. https://kubernetes.io/docs/concepts/services-networking/dns-pod-service/#a-aaaa-records-1 Best, Yang Chesnay Schepler <ches...@apache.org> 于2020年10月15日周四 下午8:41写道: > The InetAddress caches the result of getCanonicalHostName(), so it is not > a problem to call it twice. > > On 10/15/2020 1:57 PM, Till Rohrmann wrote: > > Hi Weike, > > thanks for getting back to us with your findings. Looking at the > `TaskManagerLocation`, we are actually calling > `InetAddress.getCanonicalHostName` twice for every creation of a > `TaskManagerLocation` instance. This does not look right. > > I think it should be fine to make the look up configurable. Moreover, one > could think about only doing a lazy look up if the canonical hostname is > really needed (as far as I can see it is only really needed input split > assignments and for the LocationPreferenceSlotSelectionStrategy to > calculate how many TMs run on the same machine). > > Do you want to fix this issue? > > Cheers, > Till > > On Thu, Oct 15, 2020 at 11:38 AM DONG, Weike <kyled...@connect.hku.hk> > wrote: > >> Hi Till and community, >> >> By the way, initially I resolved the IPs several times but results >> returned rather quickly (less than 1ms, possibly due to DNS cache on the >> server), so I thought it might not be the DNS issue. >> >> However, after debugging and logging, it is found that the lookup time >> exhibited high variance, i. e. normally it completes fast but occasionally >> some slow results would block the thread. So an unstable DNS server might >> have a great impact on the performance of Flink job startup. >> >> Best, >> Weike >> >> On Thu, Oct 15, 2020 at 5:19 PM DONG, Weike <kyled...@connect.hku.hk> >> wrote: >> >>> Hi Till and community, >>> >>> Increasing `kubernetes.jobmanager.cpu` in the configuration makes this >>> issue alleviated but not disappeared. >>> >>> After adding DEBUG logs to the internals of *flink-runtime*, we have >>> found the culprit is >>> >>> inetAddress.getCanonicalHostName() >>> >>> in >>> *org.apache.flink.runtime.taskmanager.TaskManagerLocation#getHostName* >>> and >>> *org.apache.flink.runtime.taskmanager.TaskManagerLocation#getFqdnHostName*, >>> which could take ~ 6 seconds to complete, thus Akka dispatcher(s) >>> are severely blocked by that. >>> >>> By commenting out the two methods, this issue seems to be solved >>> immediately, so I wonder if Flink could provide a configuration parameter >>> to turn off the DNS reverse lookup process, as it seems that Flink jobs >>> could run happily without it. >>> >>> Sincerely, >>> Weike >>> >>> >>> On Tue, Oct 13, 2020 at 6:52 PM Till Rohrmann <trohrm...@apache.org> >>> wrote: >>> >>>> Hi Weike, >>>> >>>> could you try setting kubernetes.jobmanager.cpu: 4 in your >>>> flink-conf.yaml? I fear that a single CPU is too low for the JobManager >>>> component. >>>> >>>> Cheers, >>>> Till >>>> >>>> On Tue, Oct 13, 2020 at 11:33 AM Till Rohrmann <trohrm...@apache.org> >>>> wrote: >>>> >>>>> Hi Weike, >>>>> >>>>> thanks for posting the logs. I will take a look at them. My suspicion >>>>> would be that there is some operation blocking the JobMaster's main thread >>>>> which causes the registrations from the TMs to time out. Maybe the logs >>>>> allow me to validate/falsify this suspicion. >>>>> >>>>> Cheers, >>>>> Till >>>>> >>>>> On Mon, Oct 12, 2020 at 10:43 AM DONG, Weike <kyled...@connect.hku.hk> >>>>> wrote: >>>>> >>>>>> Hi community, >>>>>> >>>>>> I have uploaded the log files of JobManager and TaskManager-1-1 (one >>>>>> of the 50 TaskManagers) with DEBUG log level and default Flink >>>>>> configuration, and it clearly shows that TaskManager failed to register >>>>>> with JobManager after 10 attempts. >>>>>> >>>>>> Here is the link: >>>>>> >>>>>> JobManager: >>>>>> https://gist.github.com/kylemeow/740c470d9b5a1ab3552376193920adce >>>>>> >>>>>> TaskManager-1-1: >>>>>> https://gist.github.com/kylemeow/41b9a8fe91975875c40afaf58276c2fe >>>>>> >>>>>> Thanks : ) >>>>>> >>>>>> Best regards, >>>>>> Weike >>>>>> >>>>>> >>>>>> On Mon, Oct 12, 2020 at 4:14 PM DONG, Weike <kyled...@connect.hku.hk> >>>>>> wrote: >>>>>> >>>>>>> Hi community, >>>>>>> >>>>>>> Recently we have noticed a strange behavior for Flink jobs on >>>>>>> Kubernetes per-job mode: when the parallelism increases, the time it >>>>>>> takes >>>>>>> for the TaskManagers to register with *JobManager *becomes >>>>>>> abnormally long (for a task with parallelism of 50, it could take 60 ~ >>>>>>> 120 >>>>>>> seconds or even longer for the registration attempt), and usually more >>>>>>> than >>>>>>> 10 attempts are needed to finish this registration. >>>>>>> >>>>>>> Because of this, we could not submit a job requiring more than 20 >>>>>>> slots with the default configuration, as the TaskManager would say: >>>>>>> >>>>>>> >>>>>>>> Registration at JobManager >>>>>>>> (akka.tcp://flink@myjob-201076.default:6123/user/rpc/jobmanager_2) >>>>>>>> attempt 9 timed out after 25600 ms >>>>>>> >>>>>>> Free slot with allocation id 60d5277e138a94fb73fc6691557001e0 >>>>>>>> because: The slot 60d5277e138a94fb73fc6691557001e0 has timed out. >>>>>>> >>>>>>> Free slot TaskSlot(index:0, state:ALLOCATED, resource profile: >>>>>>>> ResourceProfile{cpuCores=1.0000000000000000, taskHeapMemory=1.425gb >>>>>>>> (1530082070 bytes), taskOffHeapMemory=0 bytes, managedMemory=1.340gb >>>>>>>> (1438814063 bytes), networkMemory=343.040mb (359703515 bytes)}, >>>>>>>> allocationId: 60d5277e138a94fb73fc6691557001e0, jobId: >>>>>>>> 493cd86e389ccc8f2887e1222903b5ce). >>>>>>>> java.lang.Exception: The slot 60d5277e138a94fb73fc6691557001e0 has >>>>>>>> timed out. >>>>>>> >>>>>>> >>>>>>> In order to cope with this issue, we have to change the below >>>>>>> configuration parameters: >>>>>>> >>>>>>>> >>>>>>>> # Prevent "Could not allocate the required slot within slot >>>>>>>> request timeout. Please make sure that the cluster has enough >>>>>>>> resources. >>>>>>>> Stopping the JobMaster for job" >>>>>>>> slot.request.timeout: 500000 >>>>>>> >>>>>>> # Increase max timeout in a single attempt >>>>>>>> cluster.registration.max-timeout: 300000 >>>>>>>> # Prevent "free slot (TaskSlot)" >>>>>>>> akka.ask.timeout: 10 min >>>>>>>> # Prevent "Heartbeat of TaskManager timed out." >>>>>>>> heartbeat.timeout: 500000 >>>>>>> >>>>>>> >>>>>>> However, we acknowledge that this is only a temporary dirty fix, >>>>>>> which is not what we want. It could be seen that during TaskManager >>>>>>> registration to JobManager, lots of warning messages come out in logs: >>>>>>> >>>>>>> No hostname could be resolved for the IP address 9.166.0.118, using >>>>>>>> IP address as host name. Local input split assignment (such as for HDFS >>>>>>>> files) may be impacted. >>>>>>> >>>>>>> >>>>>>> Initially we thought this was probably the cause (reverse lookup of >>>>>>> DNS might take up a long time), however we later found that the reverse >>>>>>> lookup only took less than 1ms, so maybe not because of this. >>>>>>> >>>>>>> Also, we have checked the GC log of both TaskManagers and >>>>>>> JobManager, and they seem to be perfectly normal, without any signs of >>>>>>> pauses. And the heartbeats are processed as normal according to the >>>>>>> logs. >>>>>>> >>>>>>> Moreover, TaskManagers register quickly with ResourceManager, but >>>>>>> then extra slow with TaskManager, so this is not because of a slow >>>>>>> network >>>>>>> connection. >>>>>>> >>>>>>> Here we wonder what could be the cause for the slow registration >>>>>>> between JobManager and TaskManager(s)? No other warning or error >>>>>>> messages >>>>>>> in the log (DEBUG level) other than the "No hostname could be resolved" >>>>>>> messages, which is quite weird. >>>>>>> >>>>>>> Thanks for the reading, and hope to get some insights into this >>>>>>> issues : ) >>>>>>> >>>>>>> Sincerely, >>>>>>> Weike >>>>>>> >>>>>>> >>>>>>> >>>>>> >