Re: TaskManager takes abnormally long time to register with JobManager on Kubernetes for Flink 1.11.0

Till Rohrmann Fri, 16 Oct 2020 05:18:24 -0700

Done, you are assigned now Weike.

Cheers,
Till


On Fri, Oct 16, 2020 at 1:33 PM DONG, Weike <kyled...@connect.hku.hk> wrote:

> Hi Till,
>
> Thank you for the kind reminder, and I have created a JIRA ticket for this
> issue https://issues.apache.org/jira/browse/FLINK-19677
>
> Could you please assign it to me? I will try to submit a PR this weekend
> to fix this : )
>
> Sincerely,
> Weike
>
> On Fri, Oct 16, 2020 at 5:54 PM Till Rohrmann <trohrm...@apache.org>
> wrote:
>
>> Great, thanks a lot Weike. I think the first step would be to open a JIRA
>> issue, get assigned and then start on fixing it and opening a PR.
>>
>> Cheers,
>> Till
>>
>> On Fri, Oct 16, 2020 at 10:02 AM DONG, Weike <kyled...@connect.hku.hk>
>> wrote:
>>
>>> Hi all,
>>>
>>> Thanks for all the replies, and I agree with Yang, as we have found that
>>> for a pod without a service (like TaskManager pod), the reverse DNS lookup
>>> would always fail, so this lookup is not necessary for the Kubernetes
>>> environment.
>>>
>>> I am glad to help fix this issue to make Flink better : )
>>>
>>> Best,
>>> Weike
>>>
>>> On Thu, Oct 15, 2020 at 7:57 PM Till Rohrmann <trohrm...@apache.org>
>>> wrote:
>>>
>>>> Hi Weike,
>>>>
>>>> thanks for getting back to us with your findings. Looking at the
>>>> `TaskManagerLocation`, we are actually calling
>>>> `InetAddress.getCanonicalHostName` twice for every creation of a
>>>> `TaskManagerLocation` instance. This does not look right.
>>>>
>>>> I think it should be fine to make the look up configurable. Moreover,
>>>> one could think about only doing a lazy look up if the canonical hostname
>>>> is really needed (as far as I can see it is only really needed input split
>>>> assignments and for the LocationPreferenceSlotSelectionStrategy to
>>>> calculate how many TMs run on the same machine).
>>>>
>>>> Do you want to fix this issue?
>>>>
>>>> Cheers,
>>>> Till
>>>>
>>>> On Thu, Oct 15, 2020 at 11:38 AM DONG, Weike <kyled...@connect.hku.hk>
>>>> wrote:
>>>>
>>>>> Hi Till and community,
>>>>>
>>>>> By the way, initially I resolved the IPs several times but results
>>>>> returned rather quickly (less than 1ms, possibly due to DNS cache on the
>>>>> server), so I thought it might not be the DNS issue.
>>>>>
>>>>> However, after debugging and logging, it is found that the lookup time
>>>>> exhibited high variance, i. e. normally it completes fast but occasionally
>>>>> some slow results would block the thread. So an unstable DNS server might
>>>>> have a great impact on the performance of Flink job startup.
>>>>>
>>>>> Best,
>>>>> Weike
>>>>>
>>>>> On Thu, Oct 15, 2020 at 5:19 PM DONG, Weike <kyled...@connect.hku.hk>
>>>>> wrote:
>>>>>
>>>>>> Hi Till and community,
>>>>>>
>>>>>> Increasing `kubernetes.jobmanager.cpu` in the configuration makes
>>>>>> this issue alleviated but not disappeared.
>>>>>>
>>>>>> After adding DEBUG logs to the internals of *flink-runtime*, we have
>>>>>> found the culprit is
>>>>>>
>>>>>> inetAddress.getCanonicalHostName()
>>>>>>
>>>>>> in
>>>>>> *org.apache.flink.runtime.taskmanager.TaskManagerLocation#getHostName*
>>>>>> and
>>>>>> *org.apache.flink.runtime.taskmanager.TaskManagerLocation#getFqdnHostName*,
>>>>>> which could take ~ 6 seconds to complete, thus Akka dispatcher(s)
>>>>>> are severely blocked by that.
>>>>>>
>>>>>> By commenting out the two methods, this issue seems to be solved
>>>>>> immediately, so I wonder if Flink could provide a configuration parameter
>>>>>> to turn off the DNS reverse lookup process, as it seems that Flink jobs
>>>>>> could run happily without it.
>>>>>>
>>>>>> Sincerely,
>>>>>> Weike
>>>>>>
>>>>>>
>>>>>> On Tue, Oct 13, 2020 at 6:52 PM Till Rohrmann <trohrm...@apache.org>
>>>>>> wrote:
>>>>>>
>>>>>>> Hi Weike,
>>>>>>>
>>>>>>> could you try setting kubernetes.jobmanager.cpu: 4 in your
>>>>>>> flink-conf.yaml? I fear that a single CPU is too low for the JobManager
>>>>>>> component.
>>>>>>>
>>>>>>> Cheers,
>>>>>>> Till
>>>>>>>
>>>>>>> On Tue, Oct 13, 2020 at 11:33 AM Till Rohrmann <trohrm...@apache.org>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Hi Weike,
>>>>>>>>
>>>>>>>> thanks for posting the logs. I will take a look at them. My
>>>>>>>> suspicion would be that there is some operation blocking the 
>>>>>>>> JobMaster's
>>>>>>>> main thread which causes the registrations from the TMs to time out. 
>>>>>>>> Maybe
>>>>>>>> the logs allow me to validate/falsify this suspicion.
>>>>>>>>
>>>>>>>> Cheers,
>>>>>>>> Till
>>>>>>>>
>>>>>>>> On Mon, Oct 12, 2020 at 10:43 AM DONG, Weike <
>>>>>>>> kyled...@connect.hku.hk> wrote:
>>>>>>>>
>>>>>>>>> Hi community,
>>>>>>>>>
>>>>>>>>> I have uploaded the log files of JobManager and TaskManager-1-1
>>>>>>>>> (one of the 50 TaskManagers) with DEBUG log level and default Flink
>>>>>>>>> configuration, and it clearly shows that TaskManager failed to 
>>>>>>>>> register
>>>>>>>>> with JobManager after 10 attempts.
>>>>>>>>>
>>>>>>>>> Here is the link:
>>>>>>>>>
>>>>>>>>> JobManager:
>>>>>>>>> https://gist.github.com/kylemeow/740c470d9b5a1ab3552376193920adce
>>>>>>>>>
>>>>>>>>> TaskManager-1-1:
>>>>>>>>> https://gist.github.com/kylemeow/41b9a8fe91975875c40afaf58276c2fe
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Thanks : )
>>>>>>>>>
>>>>>>>>> Best regards,
>>>>>>>>> Weike
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Mon, Oct 12, 2020 at 4:14 PM DONG, Weike <
>>>>>>>>> kyled...@connect.hku.hk> wrote:
>>>>>>>>>
>>>>>>>>>> Hi community,
>>>>>>>>>>
>>>>>>>>>> Recently we have noticed a strange behavior for Flink jobs on
>>>>>>>>>> Kubernetes per-job mode: when the parallelism increases, the time it 
>>>>>>>>>> takes
>>>>>>>>>> for the TaskManagers to register with *JobManager *becomes
>>>>>>>>>> abnormally long (for a task with parallelism of 50, it could take 60 
>>>>>>>>>> ~ 120
>>>>>>>>>> seconds or even longer for the registration attempt), and usually 
>>>>>>>>>> more than
>>>>>>>>>> 10 attempts are needed to finish this registration.
>>>>>>>>>>
>>>>>>>>>> Because of this, we could not submit a job requiring more than 20
>>>>>>>>>> slots with the default configuration, as the TaskManager would say:
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>> Registration at JobManager 
>>>>>>>>>>> (akka.tcp://flink@myjob-201076.default:6123/user/rpc/jobmanager_2)
>>>>>>>>>>> attempt 9 timed out after 25600 ms
>>>>>>>>>>
>>>>>>>>>> Free slot with allocation id 60d5277e138a94fb73fc6691557001e0
>>>>>>>>>>> because: The slot 60d5277e138a94fb73fc6691557001e0 has timed out.
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Free slot TaskSlot(index:0, state:ALLOCATED, resource profile:
>>>>>>>>>>> ResourceProfile{cpuCores=1.0000000000000000, taskHeapMemory=1.425gb
>>>>>>>>>>> (1530082070 bytes), taskOffHeapMemory=0 bytes, managedMemory=1.340gb
>>>>>>>>>>> (1438814063 bytes), networkMemory=343.040mb (359703515 bytes)},
>>>>>>>>>>> allocationId: 60d5277e138a94fb73fc6691557001e0, jobId:
>>>>>>>>>>> 493cd86e389ccc8f2887e1222903b5ce).
>>>>>>>>>>> java.lang.Exception: The slot 60d5277e138a94fb73fc6691557001e0
>>>>>>>>>>> has timed out.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> In order to cope with this issue, we have to change the below
>>>>>>>>>> configuration parameters:
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> # Prevent "Could not allocate the required slot within slot
>>>>>>>>>>> request timeout. Please make sure that the cluster has enough 
>>>>>>>>>>> resources.
>>>>>>>>>>> Stopping the JobMaster for job"
>>>>>>>>>>> slot.request.timeout: 500000
>>>>>>>>>>
>>>>>>>>>> # Increase max timeout in a single attempt
>>>>>>>>>>> cluster.registration.max-timeout: 300000
>>>>>>>>>>> # Prevent "free slot (TaskSlot)"
>>>>>>>>>>> akka.ask.timeout: 10 min
>>>>>>>>>>> # Prevent "Heartbeat of TaskManager timed out."
>>>>>>>>>>> heartbeat.timeout: 500000
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> However, we acknowledge that this is only a temporary dirty fix,
>>>>>>>>>> which is not what we want. It could be seen that during TaskManager
>>>>>>>>>> registration to JobManager, lots of warning messages come out in 
>>>>>>>>>> logs:
>>>>>>>>>>
>>>>>>>>>> No hostname could be resolved for the IP address 9.166.0.118,
>>>>>>>>>>> using IP address as host name. Local input split assignment (such 
>>>>>>>>>>> as for
>>>>>>>>>>> HDFS files) may be impacted.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Initially we thought this was probably the cause (reverse lookup
>>>>>>>>>> of DNS might take up a long time), however we later found that the 
>>>>>>>>>> reverse
>>>>>>>>>> lookup only took less than 1ms, so maybe not because of this.
>>>>>>>>>>
>>>>>>>>>> Also, we have checked the GC log of both TaskManagers and
>>>>>>>>>> JobManager, and they seem to be perfectly normal, without any signs 
>>>>>>>>>> of
>>>>>>>>>> pauses. And the heartbeats are processed as normal according to the 
>>>>>>>>>> logs.
>>>>>>>>>>
>>>>>>>>>> Moreover, TaskManagers register quickly with ResourceManager, but
>>>>>>>>>> then extra slow with TaskManager, so this is not because of a slow 
>>>>>>>>>> network
>>>>>>>>>> connection.
>>>>>>>>>>
>>>>>>>>>> Here we wonder what could be the cause for the slow registration
>>>>>>>>>> between JobManager and TaskManager(s)? No other warning or error 
>>>>>>>>>> messages
>>>>>>>>>> in the log (DEBUG level) other than the "No hostname could be 
>>>>>>>>>> resolved"
>>>>>>>>>> messages, which is quite weird.
>>>>>>>>>>
>>>>>>>>>> Thanks for the reading, and hope to get some insights into this
>>>>>>>>>> issues : )
>>>>>>>>>>
>>>>>>>>>> Sincerely,
>>>>>>>>>> Weike
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>

Re: TaskManager takes abnormally long time to register with JobManager on Kubernetes for Flink 1.11.0

Reply via email to