Hi Weike,

could you try setting kubernetes.jobmanager.cpu: 4 in your flink-conf.yaml?
I fear that a single CPU is too low for the JobManager component.

Cheers,
Till

On Tue, Oct 13, 2020 at 11:33 AM Till Rohrmann <trohrm...@apache.org> wrote:

> Hi Weike,
>
> thanks for posting the logs. I will take a look at them. My suspicion
> would be that there is some operation blocking the JobMaster's main thread
> which causes the registrations from the TMs to time out. Maybe the logs
> allow me to validate/falsify this suspicion.
>
> Cheers,
> Till
>
> On Mon, Oct 12, 2020 at 10:43 AM DONG, Weike <kyled...@connect.hku.hk>
> wrote:
>
>> Hi community,
>>
>> I have uploaded the log files of JobManager and TaskManager-1-1 (one of
>> the 50 TaskManagers) with DEBUG log level and default Flink configuration,
>> and it clearly shows that TaskManager failed to register with JobManager
>> after 10 attempts.
>>
>> Here is the link:
>>
>> JobManager:
>> https://gist.github.com/kylemeow/740c470d9b5a1ab3552376193920adce
>>
>> TaskManager-1-1:
>> https://gist.github.com/kylemeow/41b9a8fe91975875c40afaf58276c2fe
>>
>> Thanks : )
>>
>> Best regards,
>> Weike
>>
>>
>> On Mon, Oct 12, 2020 at 4:14 PM DONG, Weike <kyled...@connect.hku.hk>
>> wrote:
>>
>>> Hi community,
>>>
>>> Recently we have noticed a strange behavior for Flink jobs on Kubernetes
>>> per-job mode: when the parallelism increases, the time it takes for the
>>> TaskManagers to register with *JobManager *becomes abnormally long (for
>>> a task with parallelism of 50, it could take 60 ~ 120 seconds or even
>>> longer for the registration attempt), and usually more than 10 attempts are
>>> needed to finish this registration.
>>>
>>> Because of this, we could not submit a job requiring more than 20 slots
>>> with the default configuration, as the TaskManager would say:
>>>
>>>
>>>> Registration at JobManager 
>>>> (akka.tcp://flink@myjob-201076.default:6123/user/rpc/jobmanager_2)
>>>> attempt 9 timed out after 25600 ms
>>>
>>> Free slot with allocation id 60d5277e138a94fb73fc6691557001e0 because:
>>>> The slot 60d5277e138a94fb73fc6691557001e0 has timed out.
>>>
>>> Free slot TaskSlot(index:0, state:ALLOCATED, resource profile:
>>>> ResourceProfile{cpuCores=1.0000000000000000, taskHeapMemory=1.425gb
>>>> (1530082070 bytes), taskOffHeapMemory=0 bytes, managedMemory=1.340gb
>>>> (1438814063 bytes), networkMemory=343.040mb (359703515 bytes)},
>>>> allocationId: 60d5277e138a94fb73fc6691557001e0, jobId:
>>>> 493cd86e389ccc8f2887e1222903b5ce).
>>>> java.lang.Exception: The slot 60d5277e138a94fb73fc6691557001e0 has
>>>> timed out.
>>>
>>>
>>> In order to cope with this issue, we have to change the below
>>> configuration parameters:
>>>
>>>>
>>>> # Prevent "Could not allocate the required slot within slot request
>>>> timeout. Please make sure that the cluster has enough resources. Stopping
>>>> the JobMaster for job"
>>>> slot.request.timeout: 500000
>>>
>>> # Increase max timeout in a single attempt
>>>> cluster.registration.max-timeout: 300000
>>>> # Prevent "free slot (TaskSlot)"
>>>> akka.ask.timeout: 10 min
>>>> # Prevent "Heartbeat of TaskManager timed out."
>>>> heartbeat.timeout: 500000
>>>
>>>
>>> However, we acknowledge that this is only a temporary dirty fix, which
>>> is not what we want. It could be seen that during TaskManager registration
>>> to JobManager, lots of warning messages come out in logs:
>>>
>>> No hostname could be resolved for the IP address 9.166.0.118, using IP
>>>> address as host name. Local input split assignment (such as for HDFS files)
>>>> may be impacted.
>>>
>>>
>>> Initially we thought this was probably the cause (reverse lookup of DNS
>>> might take up a long time), however we later found that the reverse lookup
>>> only took less than 1ms, so maybe not because of this.
>>>
>>> Also, we have checked the GC log of both TaskManagers and JobManager,
>>> and they seem to be perfectly normal, without any signs of pauses. And the
>>> heartbeats are processed as normal according to the logs.
>>>
>>> Moreover, TaskManagers register quickly with ResourceManager, but then
>>> extra slow with TaskManager, so this is not because of a slow network
>>> connection.
>>>
>>> Here we wonder what could be the cause for the slow registration between
>>> JobManager and TaskManager(s)? No other warning or error messages in the
>>> log (DEBUG level) other than the "No hostname could be resolved" messages,
>>> which is quite weird.
>>>
>>> Thanks for the reading, and hope to get some insights into this issues :
>>> )
>>>
>>> Sincerely,
>>> Weike
>>>
>>>
>>>
>>

Reply via email to