Hi Till, Thank you for the kind reminder, and I have created a JIRA ticket for this issue https://issues.apache.org/jira/browse/FLINK-19677
Could you please assign it to me? I will try to submit a PR this weekend to fix this : ) Sincerely, Weike On Fri, Oct 16, 2020 at 5:54 PM Till Rohrmann <trohrm...@apache.org> wrote: > Great, thanks a lot Weike. I think the first step would be to open a JIRA > issue, get assigned and then start on fixing it and opening a PR. > > Cheers, > Till > > On Fri, Oct 16, 2020 at 10:02 AM DONG, Weike <kyled...@connect.hku.hk> > wrote: > >> Hi all, >> >> Thanks for all the replies, and I agree with Yang, as we have found that >> for a pod without a service (like TaskManager pod), the reverse DNS lookup >> would always fail, so this lookup is not necessary for the Kubernetes >> environment. >> >> I am glad to help fix this issue to make Flink better : ) >> >> Best, >> Weike >> >> On Thu, Oct 15, 2020 at 7:57 PM Till Rohrmann <trohrm...@apache.org> >> wrote: >> >>> Hi Weike, >>> >>> thanks for getting back to us with your findings. Looking at the >>> `TaskManagerLocation`, we are actually calling >>> `InetAddress.getCanonicalHostName` twice for every creation of a >>> `TaskManagerLocation` instance. This does not look right. >>> >>> I think it should be fine to make the look up configurable. Moreover, >>> one could think about only doing a lazy look up if the canonical hostname >>> is really needed (as far as I can see it is only really needed input split >>> assignments and for the LocationPreferenceSlotSelectionStrategy to >>> calculate how many TMs run on the same machine). >>> >>> Do you want to fix this issue? >>> >>> Cheers, >>> Till >>> >>> On Thu, Oct 15, 2020 at 11:38 AM DONG, Weike <kyled...@connect.hku.hk> >>> wrote: >>> >>>> Hi Till and community, >>>> >>>> By the way, initially I resolved the IPs several times but results >>>> returned rather quickly (less than 1ms, possibly due to DNS cache on the >>>> server), so I thought it might not be the DNS issue. >>>> >>>> However, after debugging and logging, it is found that the lookup time >>>> exhibited high variance, i. e. normally it completes fast but occasionally >>>> some slow results would block the thread. So an unstable DNS server might >>>> have a great impact on the performance of Flink job startup. >>>> >>>> Best, >>>> Weike >>>> >>>> On Thu, Oct 15, 2020 at 5:19 PM DONG, Weike <kyled...@connect.hku.hk> >>>> wrote: >>>> >>>>> Hi Till and community, >>>>> >>>>> Increasing `kubernetes.jobmanager.cpu` in the configuration makes this >>>>> issue alleviated but not disappeared. >>>>> >>>>> After adding DEBUG logs to the internals of *flink-runtime*, we have >>>>> found the culprit is >>>>> >>>>> inetAddress.getCanonicalHostName() >>>>> >>>>> in >>>>> *org.apache.flink.runtime.taskmanager.TaskManagerLocation#getHostName* >>>>> and >>>>> *org.apache.flink.runtime.taskmanager.TaskManagerLocation#getFqdnHostName*, >>>>> which could take ~ 6 seconds to complete, thus Akka dispatcher(s) >>>>> are severely blocked by that. >>>>> >>>>> By commenting out the two methods, this issue seems to be solved >>>>> immediately, so I wonder if Flink could provide a configuration parameter >>>>> to turn off the DNS reverse lookup process, as it seems that Flink jobs >>>>> could run happily without it. >>>>> >>>>> Sincerely, >>>>> Weike >>>>> >>>>> >>>>> On Tue, Oct 13, 2020 at 6:52 PM Till Rohrmann <trohrm...@apache.org> >>>>> wrote: >>>>> >>>>>> Hi Weike, >>>>>> >>>>>> could you try setting kubernetes.jobmanager.cpu: 4 in your >>>>>> flink-conf.yaml? I fear that a single CPU is too low for the JobManager >>>>>> component. >>>>>> >>>>>> Cheers, >>>>>> Till >>>>>> >>>>>> On Tue, Oct 13, 2020 at 11:33 AM Till Rohrmann <trohrm...@apache.org> >>>>>> wrote: >>>>>> >>>>>>> Hi Weike, >>>>>>> >>>>>>> thanks for posting the logs. I will take a look at them. My >>>>>>> suspicion would be that there is some operation blocking the JobMaster's >>>>>>> main thread which causes the registrations from the TMs to time out. >>>>>>> Maybe >>>>>>> the logs allow me to validate/falsify this suspicion. >>>>>>> >>>>>>> Cheers, >>>>>>> Till >>>>>>> >>>>>>> On Mon, Oct 12, 2020 at 10:43 AM DONG, Weike < >>>>>>> kyled...@connect.hku.hk> wrote: >>>>>>> >>>>>>>> Hi community, >>>>>>>> >>>>>>>> I have uploaded the log files of JobManager and TaskManager-1-1 >>>>>>>> (one of the 50 TaskManagers) with DEBUG log level and default Flink >>>>>>>> configuration, and it clearly shows that TaskManager failed to register >>>>>>>> with JobManager after 10 attempts. >>>>>>>> >>>>>>>> Here is the link: >>>>>>>> >>>>>>>> JobManager: >>>>>>>> https://gist.github.com/kylemeow/740c470d9b5a1ab3552376193920adce >>>>>>>> >>>>>>>> TaskManager-1-1: >>>>>>>> https://gist.github.com/kylemeow/41b9a8fe91975875c40afaf58276c2fe >>>>>>>> >>>>>>>> Thanks : ) >>>>>>>> >>>>>>>> Best regards, >>>>>>>> Weike >>>>>>>> >>>>>>>> >>>>>>>> On Mon, Oct 12, 2020 at 4:14 PM DONG, Weike < >>>>>>>> kyled...@connect.hku.hk> wrote: >>>>>>>> >>>>>>>>> Hi community, >>>>>>>>> >>>>>>>>> Recently we have noticed a strange behavior for Flink jobs on >>>>>>>>> Kubernetes per-job mode: when the parallelism increases, the time it >>>>>>>>> takes >>>>>>>>> for the TaskManagers to register with *JobManager *becomes >>>>>>>>> abnormally long (for a task with parallelism of 50, it could take 60 >>>>>>>>> ~ 120 >>>>>>>>> seconds or even longer for the registration attempt), and usually >>>>>>>>> more than >>>>>>>>> 10 attempts are needed to finish this registration. >>>>>>>>> >>>>>>>>> Because of this, we could not submit a job requiring more than 20 >>>>>>>>> slots with the default configuration, as the TaskManager would say: >>>>>>>>> >>>>>>>>> >>>>>>>>>> Registration at JobManager >>>>>>>>>> (akka.tcp://flink@myjob-201076.default:6123/user/rpc/jobmanager_2) >>>>>>>>>> attempt 9 timed out after 25600 ms >>>>>>>>> >>>>>>>>> Free slot with allocation id 60d5277e138a94fb73fc6691557001e0 >>>>>>>>>> because: The slot 60d5277e138a94fb73fc6691557001e0 has timed out. >>>>>>>>>> >>>>>>>>> >>>>>>>>> Free slot TaskSlot(index:0, state:ALLOCATED, resource profile: >>>>>>>>>> ResourceProfile{cpuCores=1.0000000000000000, taskHeapMemory=1.425gb >>>>>>>>>> (1530082070 bytes), taskOffHeapMemory=0 bytes, managedMemory=1.340gb >>>>>>>>>> (1438814063 bytes), networkMemory=343.040mb (359703515 bytes)}, >>>>>>>>>> allocationId: 60d5277e138a94fb73fc6691557001e0, jobId: >>>>>>>>>> 493cd86e389ccc8f2887e1222903b5ce). >>>>>>>>>> java.lang.Exception: The slot 60d5277e138a94fb73fc6691557001e0 >>>>>>>>>> has timed out. >>>>>>>>> >>>>>>>>> >>>>>>>>> In order to cope with this issue, we have to change the below >>>>>>>>> configuration parameters: >>>>>>>>> >>>>>>>>>> >>>>>>>>>> # Prevent "Could not allocate the required slot within slot >>>>>>>>>> request timeout. Please make sure that the cluster has enough >>>>>>>>>> resources. >>>>>>>>>> Stopping the JobMaster for job" >>>>>>>>>> slot.request.timeout: 500000 >>>>>>>>> >>>>>>>>> # Increase max timeout in a single attempt >>>>>>>>>> cluster.registration.max-timeout: 300000 >>>>>>>>>> # Prevent "free slot (TaskSlot)" >>>>>>>>>> akka.ask.timeout: 10 min >>>>>>>>>> # Prevent "Heartbeat of TaskManager timed out." >>>>>>>>>> heartbeat.timeout: 500000 >>>>>>>>> >>>>>>>>> >>>>>>>>> However, we acknowledge that this is only a temporary dirty fix, >>>>>>>>> which is not what we want. It could be seen that during TaskManager >>>>>>>>> registration to JobManager, lots of warning messages come out in logs: >>>>>>>>> >>>>>>>>> No hostname could be resolved for the IP address 9.166.0.118, >>>>>>>>>> using IP address as host name. Local input split assignment (such as >>>>>>>>>> for >>>>>>>>>> HDFS files) may be impacted. >>>>>>>>> >>>>>>>>> >>>>>>>>> Initially we thought this was probably the cause (reverse lookup >>>>>>>>> of DNS might take up a long time), however we later found that the >>>>>>>>> reverse >>>>>>>>> lookup only took less than 1ms, so maybe not because of this. >>>>>>>>> >>>>>>>>> Also, we have checked the GC log of both TaskManagers and >>>>>>>>> JobManager, and they seem to be perfectly normal, without any signs of >>>>>>>>> pauses. And the heartbeats are processed as normal according to the >>>>>>>>> logs. >>>>>>>>> >>>>>>>>> Moreover, TaskManagers register quickly with ResourceManager, but >>>>>>>>> then extra slow with TaskManager, so this is not because of a slow >>>>>>>>> network >>>>>>>>> connection. >>>>>>>>> >>>>>>>>> Here we wonder what could be the cause for the slow registration >>>>>>>>> between JobManager and TaskManager(s)? No other warning or error >>>>>>>>> messages >>>>>>>>> in the log (DEBUG level) other than the "No hostname could be >>>>>>>>> resolved" >>>>>>>>> messages, which is quite weird. >>>>>>>>> >>>>>>>>> Thanks for the reading, and hope to get some insights into this >>>>>>>>> issues : ) >>>>>>>>> >>>>>>>>> Sincerely, >>>>>>>>> Weike >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>