Hi Weike, could you try setting kubernetes.jobmanager.cpu: 4 in your flink-conf.yaml? I fear that a single CPU is too low for the JobManager component.
Cheers, Till On Tue, Oct 13, 2020 at 11:33 AM Till Rohrmann <trohrm...@apache.org> wrote: > Hi Weike, > > thanks for posting the logs. I will take a look at them. My suspicion > would be that there is some operation blocking the JobMaster's main thread > which causes the registrations from the TMs to time out. Maybe the logs > allow me to validate/falsify this suspicion. > > Cheers, > Till > > On Mon, Oct 12, 2020 at 10:43 AM DONG, Weike <kyled...@connect.hku.hk> > wrote: > >> Hi community, >> >> I have uploaded the log files of JobManager and TaskManager-1-1 (one of >> the 50 TaskManagers) with DEBUG log level and default Flink configuration, >> and it clearly shows that TaskManager failed to register with JobManager >> after 10 attempts. >> >> Here is the link: >> >> JobManager: >> https://gist.github.com/kylemeow/740c470d9b5a1ab3552376193920adce >> >> TaskManager-1-1: >> https://gist.github.com/kylemeow/41b9a8fe91975875c40afaf58276c2fe >> >> Thanks : ) >> >> Best regards, >> Weike >> >> >> On Mon, Oct 12, 2020 at 4:14 PM DONG, Weike <kyled...@connect.hku.hk> >> wrote: >> >>> Hi community, >>> >>> Recently we have noticed a strange behavior for Flink jobs on Kubernetes >>> per-job mode: when the parallelism increases, the time it takes for the >>> TaskManagers to register with *JobManager *becomes abnormally long (for >>> a task with parallelism of 50, it could take 60 ~ 120 seconds or even >>> longer for the registration attempt), and usually more than 10 attempts are >>> needed to finish this registration. >>> >>> Because of this, we could not submit a job requiring more than 20 slots >>> with the default configuration, as the TaskManager would say: >>> >>> >>>> Registration at JobManager >>>> (akka.tcp://flink@myjob-201076.default:6123/user/rpc/jobmanager_2) >>>> attempt 9 timed out after 25600 ms >>> >>> Free slot with allocation id 60d5277e138a94fb73fc6691557001e0 because: >>>> The slot 60d5277e138a94fb73fc6691557001e0 has timed out. >>> >>> Free slot TaskSlot(index:0, state:ALLOCATED, resource profile: >>>> ResourceProfile{cpuCores=1.0000000000000000, taskHeapMemory=1.425gb >>>> (1530082070 bytes), taskOffHeapMemory=0 bytes, managedMemory=1.340gb >>>> (1438814063 bytes), networkMemory=343.040mb (359703515 bytes)}, >>>> allocationId: 60d5277e138a94fb73fc6691557001e0, jobId: >>>> 493cd86e389ccc8f2887e1222903b5ce). >>>> java.lang.Exception: The slot 60d5277e138a94fb73fc6691557001e0 has >>>> timed out. >>> >>> >>> In order to cope with this issue, we have to change the below >>> configuration parameters: >>> >>>> >>>> # Prevent "Could not allocate the required slot within slot request >>>> timeout. Please make sure that the cluster has enough resources. Stopping >>>> the JobMaster for job" >>>> slot.request.timeout: 500000 >>> >>> # Increase max timeout in a single attempt >>>> cluster.registration.max-timeout: 300000 >>>> # Prevent "free slot (TaskSlot)" >>>> akka.ask.timeout: 10 min >>>> # Prevent "Heartbeat of TaskManager timed out." >>>> heartbeat.timeout: 500000 >>> >>> >>> However, we acknowledge that this is only a temporary dirty fix, which >>> is not what we want. It could be seen that during TaskManager registration >>> to JobManager, lots of warning messages come out in logs: >>> >>> No hostname could be resolved for the IP address 9.166.0.118, using IP >>>> address as host name. Local input split assignment (such as for HDFS files) >>>> may be impacted. >>> >>> >>> Initially we thought this was probably the cause (reverse lookup of DNS >>> might take up a long time), however we later found that the reverse lookup >>> only took less than 1ms, so maybe not because of this. >>> >>> Also, we have checked the GC log of both TaskManagers and JobManager, >>> and they seem to be perfectly normal, without any signs of pauses. And the >>> heartbeats are processed as normal according to the logs. >>> >>> Moreover, TaskManagers register quickly with ResourceManager, but then >>> extra slow with TaskManager, so this is not because of a slow network >>> connection. >>> >>> Here we wonder what could be the cause for the slow registration between >>> JobManager and TaskManager(s)? No other warning or error messages in the >>> log (DEBUG level) other than the "No hostname could be resolved" messages, >>> which is quite weird. >>> >>> Thanks for the reading, and hope to get some insights into this issues : >>> ) >>> >>> Sincerely, >>> Weike >>> >>> >>> >>