Hi Till and community,

Increasing `kubernetes.jobmanager.cpu` in the configuration makes this
issue alleviated but not disappeared.

After adding DEBUG logs to the internals of *flink-runtime*, we have found
the culprit is

inetAddress.getCanonicalHostName()

in *org.apache.flink.runtime.taskmanager.TaskManagerLocation#getHostName*
and
*org.apache.flink.runtime.taskmanager.TaskManagerLocation#getFqdnHostName*,
which could take ~ 6 seconds to complete, thus Akka dispatcher(s)
are severely blocked by that.

By commenting out the two methods, this issue seems to be solved
immediately, so I wonder if Flink could provide a configuration parameter
to turn off the DNS reverse lookup process, as it seems that Flink jobs
could run happily without it.

Sincerely,
Weike


On Tue, Oct 13, 2020 at 6:52 PM Till Rohrmann <trohrm...@apache.org> wrote:

> Hi Weike,
>
> could you try setting kubernetes.jobmanager.cpu: 4 in your
> flink-conf.yaml? I fear that a single CPU is too low for the JobManager
> component.
>
> Cheers,
> Till
>
> On Tue, Oct 13, 2020 at 11:33 AM Till Rohrmann <trohrm...@apache.org>
> wrote:
>
>> Hi Weike,
>>
>> thanks for posting the logs. I will take a look at them. My suspicion
>> would be that there is some operation blocking the JobMaster's main thread
>> which causes the registrations from the TMs to time out. Maybe the logs
>> allow me to validate/falsify this suspicion.
>>
>> Cheers,
>> Till
>>
>> On Mon, Oct 12, 2020 at 10:43 AM DONG, Weike <kyled...@connect.hku.hk>
>> wrote:
>>
>>> Hi community,
>>>
>>> I have uploaded the log files of JobManager and TaskManager-1-1 (one of
>>> the 50 TaskManagers) with DEBUG log level and default Flink configuration,
>>> and it clearly shows that TaskManager failed to register with JobManager
>>> after 10 attempts.
>>>
>>> Here is the link:
>>>
>>> JobManager:
>>> https://gist.github.com/kylemeow/740c470d9b5a1ab3552376193920adce
>>>
>>> TaskManager-1-1:
>>> https://gist.github.com/kylemeow/41b9a8fe91975875c40afaf58276c2fe
>>>
>>> Thanks : )
>>>
>>> Best regards,
>>> Weike
>>>
>>>
>>> On Mon, Oct 12, 2020 at 4:14 PM DONG, Weike <kyled...@connect.hku.hk>
>>> wrote:
>>>
>>>> Hi community,
>>>>
>>>> Recently we have noticed a strange behavior for Flink jobs on
>>>> Kubernetes per-job mode: when the parallelism increases, the time it takes
>>>> for the TaskManagers to register with *JobManager *becomes abnormally
>>>> long (for a task with parallelism of 50, it could take 60 ~ 120 seconds or
>>>> even longer for the registration attempt), and usually more than 10
>>>> attempts are needed to finish this registration.
>>>>
>>>> Because of this, we could not submit a job requiring more than 20 slots
>>>> with the default configuration, as the TaskManager would say:
>>>>
>>>>
>>>>> Registration at JobManager 
>>>>> (akka.tcp://flink@myjob-201076.default:6123/user/rpc/jobmanager_2)
>>>>> attempt 9 timed out after 25600 ms
>>>>
>>>> Free slot with allocation id 60d5277e138a94fb73fc6691557001e0 because:
>>>>> The slot 60d5277e138a94fb73fc6691557001e0 has timed out.
>>>>
>>>> Free slot TaskSlot(index:0, state:ALLOCATED, resource profile:
>>>>> ResourceProfile{cpuCores=1.0000000000000000, taskHeapMemory=1.425gb
>>>>> (1530082070 bytes), taskOffHeapMemory=0 bytes, managedMemory=1.340gb
>>>>> (1438814063 bytes), networkMemory=343.040mb (359703515 bytes)},
>>>>> allocationId: 60d5277e138a94fb73fc6691557001e0, jobId:
>>>>> 493cd86e389ccc8f2887e1222903b5ce).
>>>>> java.lang.Exception: The slot 60d5277e138a94fb73fc6691557001e0 has
>>>>> timed out.
>>>>
>>>>
>>>> In order to cope with this issue, we have to change the below
>>>> configuration parameters:
>>>>
>>>>>
>>>>> # Prevent "Could not allocate the required slot within slot request
>>>>> timeout. Please make sure that the cluster has enough resources. Stopping
>>>>> the JobMaster for job"
>>>>> slot.request.timeout: 500000
>>>>
>>>> # Increase max timeout in a single attempt
>>>>> cluster.registration.max-timeout: 300000
>>>>> # Prevent "free slot (TaskSlot)"
>>>>> akka.ask.timeout: 10 min
>>>>> # Prevent "Heartbeat of TaskManager timed out."
>>>>> heartbeat.timeout: 500000
>>>>
>>>>
>>>> However, we acknowledge that this is only a temporary dirty fix, which
>>>> is not what we want. It could be seen that during TaskManager registration
>>>> to JobManager, lots of warning messages come out in logs:
>>>>
>>>> No hostname could be resolved for the IP address 9.166.0.118, using IP
>>>>> address as host name. Local input split assignment (such as for HDFS 
>>>>> files)
>>>>> may be impacted.
>>>>
>>>>
>>>> Initially we thought this was probably the cause (reverse lookup of DNS
>>>> might take up a long time), however we later found that the reverse lookup
>>>> only took less than 1ms, so maybe not because of this.
>>>>
>>>> Also, we have checked the GC log of both TaskManagers and JobManager,
>>>> and they seem to be perfectly normal, without any signs of pauses. And the
>>>> heartbeats are processed as normal according to the logs.
>>>>
>>>> Moreover, TaskManagers register quickly with ResourceManager, but then
>>>> extra slow with TaskManager, so this is not because of a slow network
>>>> connection.
>>>>
>>>> Here we wonder what could be the cause for the slow registration
>>>> between JobManager and TaskManager(s)? No other warning or error messages
>>>> in the log (DEBUG level) other than the "No hostname could be resolved"
>>>> messages, which is quite weird.
>>>>
>>>> Thanks for the reading, and hope to get some insights into this issues
>>>> : )
>>>>
>>>> Sincerely,
>>>> Weike
>>>>
>>>>
>>>>
>>>

Reply via email to