Hi community,

I have uploaded the log files of JobManager and TaskManager-1-1 (one of the
50 TaskManagers) with DEBUG log level and default Flink configuration, and
it clearly shows that TaskManager failed to register with JobManager after
10 attempts.

Here is the link:

JobManager:
https://gist.github.com/kylemeow/740c470d9b5a1ab3552376193920adce

TaskManager-1-1:
https://gist.github.com/kylemeow/41b9a8fe91975875c40afaf58276c2fe

Thanks : )

Best regards,
Weike


On Mon, Oct 12, 2020 at 4:14 PM DONG, Weike <kyled...@connect.hku.hk> wrote:

> Hi community,
>
> Recently we have noticed a strange behavior for Flink jobs on Kubernetes
> per-job mode: when the parallelism increases, the time it takes for the
> TaskManagers to register with *JobManager *becomes abnormally long (for a
> task with parallelism of 50, it could take 60 ~ 120 seconds or even longer
> for the registration attempt), and usually more than 10 attempts are needed
> to finish this registration.
>
> Because of this, we could not submit a job requiring more than 20 slots
> with the default configuration, as the TaskManager would say:
>
>
>> Registration at JobManager 
>> (akka.tcp://flink@myjob-201076.default:6123/user/rpc/jobmanager_2)
>> attempt 9 timed out after 25600 ms
>
> Free slot with allocation id 60d5277e138a94fb73fc6691557001e0 because: The
>> slot 60d5277e138a94fb73fc6691557001e0 has timed out.
>
> Free slot TaskSlot(index:0, state:ALLOCATED, resource profile:
>> ResourceProfile{cpuCores=1.0000000000000000, taskHeapMemory=1.425gb
>> (1530082070 bytes), taskOffHeapMemory=0 bytes, managedMemory=1.340gb
>> (1438814063 bytes), networkMemory=343.040mb (359703515 bytes)},
>> allocationId: 60d5277e138a94fb73fc6691557001e0, jobId:
>> 493cd86e389ccc8f2887e1222903b5ce).
>> java.lang.Exception: The slot 60d5277e138a94fb73fc6691557001e0 has timed
>> out.
>
>
> In order to cope with this issue, we have to change the below
> configuration parameters:
>
>>
>> # Prevent "Could not allocate the required slot within slot request
>> timeout. Please make sure that the cluster has enough resources. Stopping
>> the JobMaster for job"
>> slot.request.timeout: 500000
>
> # Increase max timeout in a single attempt
>> cluster.registration.max-timeout: 300000
>> # Prevent "free slot (TaskSlot)"
>> akka.ask.timeout: 10 min
>> # Prevent "Heartbeat of TaskManager timed out."
>> heartbeat.timeout: 500000
>
>
> However, we acknowledge that this is only a temporary dirty fix, which is
> not what we want. It could be seen that during TaskManager registration to
> JobManager, lots of warning messages come out in logs:
>
> No hostname could be resolved for the IP address 9.166.0.118, using IP
>> address as host name. Local input split assignment (such as for HDFS files)
>> may be impacted.
>
>
> Initially we thought this was probably the cause (reverse lookup of DNS
> might take up a long time), however we later found that the reverse lookup
> only took less than 1ms, so maybe not because of this.
>
> Also, we have checked the GC log of both TaskManagers and JobManager, and
> they seem to be perfectly normal, without any signs of pauses. And the
> heartbeats are processed as normal according to the logs.
>
> Moreover, TaskManagers register quickly with ResourceManager, but then
> extra slow with TaskManager, so this is not because of a slow network
> connection.
>
> Here we wonder what could be the cause for the slow registration between
> JobManager and TaskManager(s)? No other warning or error messages in the
> log (DEBUG level) other than the "No hostname could be resolved" messages,
> which is quite weird.
>
> Thanks for the reading, and hope to get some insights into this issues : )
>
> Sincerely,
> Weike
>
>
>

Reply via email to