huntercc created FLINK-23905:
--------------------------------
Summary: Reduce the load on JobManager when submitting large-scale
job with a big user jar
Key: FLINK-23905
URL: https://issues.apache.org/jira/browse/FLINK-23905
Project: Flink
Issue Type: Improvement
Components: Runtime / Task
Reporter: huntercc
As described in FLINK-20612 and FLINK-21731, there are some time-consuming
steps in the job startup phase. Recently, we found that when submitting a
large-scale job with a large user jar, the time spent on changing the status of
a task from deploying to running accounts for a high proportion of the total
time-consuming.
In the task initialization stage, the user jar needs to be pulled from the
JobManager through BlobService. JobManager has to allocate a lot of computing
power to distribute the files, which leads to a heavy load in the start-up
stage. More generally, JobManager fails to respond to the RPC request sent by
the TaskManager side in time due to high load, causing some timeout exceptions,
such as akka timeout exception, which leads to job restart and further prolongs
the start-up time of the job.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)