Re: TaskManager takes abnormally long time to register with JobManager on Kubernetes for Flink 1.11.0

DONG, Weike Fri, 16 Oct 2020 04:34:24 -0700

Hi Till,

Thank you for the kind reminder, and I have created a JIRA ticket for this
issue https://issues.apache.org/jira/browse/FLINK-19677


Could you please assign it to me? I will try to submit a PR this weekend to
fix this : )

Sincerely,
Weike

On Fri, Oct 16, 2020 at 5:54 PM Till Rohrmann <trohrm...@apache.org> wrote:

> Great, thanks a lot Weike. I think the first step would be to open a JIRA
> issue, get assigned and then start on fixing it and opening a PR.
>
> Cheers,
> Till
>
> On Fri, Oct 16, 2020 at 10:02 AM DONG, Weike <kyled...@connect.hku.hk>
> wrote:
>
>> Hi all,
>>
>> Thanks for all the replies, and I agree with Yang, as we have found that
>> for a pod without a service (like TaskManager pod), the reverse DNS lookup
>> would always fail, so this lookup is not necessary for the Kubernetes
>> environment.
>>
>> I am glad to help fix this issue to make Flink better : )
>>
>> Best,
>> Weike
>>
>> On Thu, Oct 15, 2020 at 7:57 PM Till Rohrmann <trohrm...@apache.org>
>> wrote:
>>
>>> Hi Weike,
>>>
>>> thanks for getting back to us with your findings. Looking at the
>>> `TaskManagerLocation`, we are actually calling
>>> `InetAddress.getCanonicalHostName` twice for every creation of a
>>> `TaskManagerLocation` instance. This does not look right.
>>>
>>> I think it should be fine to make the look up configurable. Moreover,
>>> one could think about only doing a lazy look up if the canonical hostname
>>> is really needed (as far as I can see it is only really needed input split
>>> assignments and for the LocationPreferenceSlotSelectionStrategy to
>>> calculate how many TMs run on the same machine).
>>>
>>> Do you want to fix this issue?
>>>
>>> Cheers,
>>> Till
>>>
>>> On Thu, Oct 15, 2020 at 11:38 AM DONG, Weike <kyled...@connect.hku.hk>
>>> wrote:
>>>
>>>> Hi Till and community,
>>>>
>>>> By the way, initially I resolved the IPs several times but results
>>>> returned rather quickly (less than 1ms, possibly due to DNS cache on the
>>>> server), so I thought it might not be the DNS issue.
>>>>
>>>> However, after debugging and logging, it is found that the lookup time
>>>> exhibited high variance, i. e. normally it completes fast but occasionally
>>>> some slow results would block the thread. So an unstable DNS server might
>>>> have a great impact on the performance of Flink job startup.
>>>>
>>>> Best,
>>>> Weike
>>>>
>>>> On Thu, Oct 15, 2020 at 5:19 PM DONG, Weike <kyled...@connect.hku.hk>
>>>> wrote:
>>>>
>>>>> Hi Till and community,
>>>>>
>>>>> Increasing `kubernetes.jobmanager.cpu` in the configuration makes this
>>>>> issue alleviated but not disappeared.
>>>>>
>>>>> After adding DEBUG logs to the internals of *flink-runtime*, we have
>>>>> found the culprit is
>>>>>
>>>>> inetAddress.getCanonicalHostName()
>>>>>
>>>>> in
>>>>> *org.apache.flink.runtime.taskmanager.TaskManagerLocation#getHostName*
>>>>> and
>>>>> *org.apache.flink.runtime.taskmanager.TaskManagerLocation#getFqdnHostName*,
>>>>> which could take ~ 6 seconds to complete, thus Akka dispatcher(s)
>>>>> are severely blocked by that.
>>>>>
>>>>> By commenting out the two methods, this issue seems to be solved
>>>>> immediately, so I wonder if Flink could provide a configuration parameter
>>>>> to turn off the DNS reverse lookup process, as it seems that Flink jobs
>>>>> could run happily without it.
>>>>>
>>>>> Sincerely,
>>>>> Weike
>>>>>
>>>>>
>>>>> On Tue, Oct 13, 2020 at 6:52 PM Till Rohrmann <trohrm...@apache.org>
>>>>> wrote:
>>>>>
>>>>>> Hi Weike,
>>>>>>
>>>>>> could you try setting kubernetes.jobmanager.cpu: 4 in your
>>>>>> flink-conf.yaml? I fear that a single CPU is too low for the JobManager
>>>>>> component.
>>>>>>
>>>>>> Cheers,
>>>>>> Till
>>>>>>
>>>>>> On Tue, Oct 13, 2020 at 11:33 AM Till Rohrmann <trohrm...@apache.org>
>>>>>> wrote:
>>>>>>
>>>>>>> Hi Weike,
>>>>>>>
>>>>>>> thanks for posting the logs. I will take a look at them. My
>>>>>>> suspicion would be that there is some operation blocking the JobMaster's
>>>>>>> main thread which causes the registrations from the TMs to time out. 
>>>>>>> Maybe
>>>>>>> the logs allow me to validate/falsify this suspicion.
>>>>>>>
>>>>>>> Cheers,
>>>>>>> Till
>>>>>>>
>>>>>>> On Mon, Oct 12, 2020 at 10:43 AM DONG, Weike <
>>>>>>> kyled...@connect.hku.hk> wrote:
>>>>>>>
>>>>>>>> Hi community,
>>>>>>>>
>>>>>>>> I have uploaded the log files of JobManager and TaskManager-1-1
>>>>>>>> (one of the 50 TaskManagers) with DEBUG log level and default Flink
>>>>>>>> configuration, and it clearly shows that TaskManager failed to register
>>>>>>>> with JobManager after 10 attempts.
>>>>>>>>
>>>>>>>> Here is the link:
>>>>>>>>
>>>>>>>> JobManager:
>>>>>>>> https://gist.github.com/kylemeow/740c470d9b5a1ab3552376193920adce
>>>>>>>>
>>>>>>>> TaskManager-1-1:
>>>>>>>> https://gist.github.com/kylemeow/41b9a8fe91975875c40afaf58276c2fe
>>>>>>>>
>>>>>>>> Thanks : )
>>>>>>>>
>>>>>>>> Best regards,
>>>>>>>> Weike
>>>>>>>>
>>>>>>>>
>>>>>>>> On Mon, Oct 12, 2020 at 4:14 PM DONG, Weike <
>>>>>>>> kyled...@connect.hku.hk> wrote:
>>>>>>>>
>>>>>>>>> Hi community,
>>>>>>>>>
>>>>>>>>> Recently we have noticed a strange behavior for Flink jobs on
>>>>>>>>> Kubernetes per-job mode: when the parallelism increases, the time it 
>>>>>>>>> takes
>>>>>>>>> for the TaskManagers to register with *JobManager *becomes
>>>>>>>>> abnormally long (for a task with parallelism of 50, it could take 60 
>>>>>>>>> ~ 120
>>>>>>>>> seconds or even longer for the registration attempt), and usually 
>>>>>>>>> more than
>>>>>>>>> 10 attempts are needed to finish this registration.
>>>>>>>>>
>>>>>>>>> Because of this, we could not submit a job requiring more than 20
>>>>>>>>> slots with the default configuration, as the TaskManager would say:
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>> Registration at JobManager 
>>>>>>>>>> (akka.tcp://flink@myjob-201076.default:6123/user/rpc/jobmanager_2)
>>>>>>>>>> attempt 9 timed out after 25600 ms
>>>>>>>>>
>>>>>>>>> Free slot with allocation id 60d5277e138a94fb73fc6691557001e0
>>>>>>>>>> because: The slot 60d5277e138a94fb73fc6691557001e0 has timed out.
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Free slot TaskSlot(index:0, state:ALLOCATED, resource profile:
>>>>>>>>>> ResourceProfile{cpuCores=1.0000000000000000, taskHeapMemory=1.425gb
>>>>>>>>>> (1530082070 bytes), taskOffHeapMemory=0 bytes, managedMemory=1.340gb
>>>>>>>>>> (1438814063 bytes), networkMemory=343.040mb (359703515 bytes)},
>>>>>>>>>> allocationId: 60d5277e138a94fb73fc6691557001e0, jobId:
>>>>>>>>>> 493cd86e389ccc8f2887e1222903b5ce).
>>>>>>>>>> java.lang.Exception: The slot 60d5277e138a94fb73fc6691557001e0
>>>>>>>>>> has timed out.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> In order to cope with this issue, we have to change the below
>>>>>>>>> configuration parameters:
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> # Prevent "Could not allocate the required slot within slot
>>>>>>>>>> request timeout. Please make sure that the cluster has enough 
>>>>>>>>>> resources.
>>>>>>>>>> Stopping the JobMaster for job"
>>>>>>>>>> slot.request.timeout: 500000
>>>>>>>>>
>>>>>>>>> # Increase max timeout in a single attempt
>>>>>>>>>> cluster.registration.max-timeout: 300000
>>>>>>>>>> # Prevent "free slot (TaskSlot)"
>>>>>>>>>> akka.ask.timeout: 10 min
>>>>>>>>>> # Prevent "Heartbeat of TaskManager timed out."
>>>>>>>>>> heartbeat.timeout: 500000
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> However, we acknowledge that this is only a temporary dirty fix,
>>>>>>>>> which is not what we want. It could be seen that during TaskManager
>>>>>>>>> registration to JobManager, lots of warning messages come out in logs:
>>>>>>>>>
>>>>>>>>> No hostname could be resolved for the IP address 9.166.0.118,
>>>>>>>>>> using IP address as host name. Local input split assignment (such as 
>>>>>>>>>> for
>>>>>>>>>> HDFS files) may be impacted.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Initially we thought this was probably the cause (reverse lookup
>>>>>>>>> of DNS might take up a long time), however we later found that the 
>>>>>>>>> reverse
>>>>>>>>> lookup only took less than 1ms, so maybe not because of this.
>>>>>>>>>
>>>>>>>>> Also, we have checked the GC log of both TaskManagers and
>>>>>>>>> JobManager, and they seem to be perfectly normal, without any signs of
>>>>>>>>> pauses. And the heartbeats are processed as normal according to the 
>>>>>>>>> logs.
>>>>>>>>>
>>>>>>>>> Moreover, TaskManagers register quickly with ResourceManager, but
>>>>>>>>> then extra slow with TaskManager, so this is not because of a slow 
>>>>>>>>> network
>>>>>>>>> connection.
>>>>>>>>>
>>>>>>>>> Here we wonder what could be the cause for the slow registration
>>>>>>>>> between JobManager and TaskManager(s)? No other warning or error 
>>>>>>>>> messages
>>>>>>>>> in the log (DEBUG level) other than the "No hostname could be 
>>>>>>>>> resolved"
>>>>>>>>> messages, which is quite weird.
>>>>>>>>>
>>>>>>>>> Thanks for the reading, and hope to get some insights into this
>>>>>>>>> issues : )
>>>>>>>>>
>>>>>>>>> Sincerely,
>>>>>>>>> Weike
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>

Re: TaskManager takes abnormally long time to register with JobManager on Kubernetes for Flink 1.11.0

Reply via email to