Re: Job Manager taking long time to upload job graph on remote storage

Prakhar Mathur Wed, 02 Sep 2020 04:04:50 -0700

Hi,

Thanks for the response. Yes, we are running Flink in HA mode. We checked
there are no such quota limits for GCS for us. Please find the logs below,
here you can see the copying of blob started at 11:50:39,455 and it
got JobGraph submission at 11:50:46,400.


2020-09-01 11:50:37,061 DEBUG
org.apache.flink.runtime.resourcemanager.slotmanager.SlotManagerImpl  -
Release TaskExecutor 2e20ee286a3fee5831fefc0ab427ba92 because it exceeded
the idle timeout.
2020-09-01 11:50:37,061 DEBUG
org.apache.flink.runtime.resourcemanager.StandaloneResourceManager  -
Worker 4e4ae8b90f911787ac112c2847759512 could not be stopped.
2020-09-01 11:50:37,062 DEBUG
org.apache.flink.runtime.resourcemanager.slotmanager.SlotManagerImpl  -
Release TaskExecutor 032a62eff2a7d8067f25b8fc943f262f because it exceeded
the idle timeout.
2020-09-01 11:50:37,062 DEBUG
org.apache.flink.runtime.resourcemanager.StandaloneResourceManager  -
Worker 2532ed667a2566e06c3d9e3bc85c6ed6 could not be stopped.
2020-09-01 11:50:37,305 DEBUG
org.apache.flink.runtime.resourcemanager.StandaloneResourceManager  -
Trigger heartbeat request.
2020-09-01 11:50:37,305 DEBUG
org.apache.flink.runtime.resourcemanager.StandaloneResourceManager  -
Trigger heartbeat request.
2020-09-01 11:50:37,354 DEBUG
org.apache.flink.runtime.resourcemanager.StandaloneResourceManager  -
Received heartbeat from 4e4ae8b90f911787ac112c2847759512.
2020-09-01 11:50:37,354 DEBUG
org.apache.flink.runtime.resourcemanager.StandaloneResourceManager  -
Received heartbeat from 2532ed667a2566e06c3d9e3bc85c6ed6.
2020-09-01 11:50:39,455 DEBUG
org.apache.flink.runtime.blob.FileSystemBlobStore             - Copying
from
/tmp/flink-blobs/blobStore-6e468470-ba5f-4ea0-a8fd-cf31af663f11/job_980d3ff229b7fbfe889e2bc93e526da0/blob_p-90776d1fe82af438f6fe2c4385461fe6cb96d25a-86f63972fbf1e10b502f1640fe01b426
to
gs://<test-bucket>/blob/job_980d3ff229b7fbfe889e2bc93e526da0/blob_p-90776d1fe82af438f6fe2c4385461fe6cb96d25a-86f63972fbf1e10b502f1640fe01b426.
2020-09-01 11:50:43,904 DEBUG
org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn  - Got
ping response for sessionid: 0x30be3d929102460 after 2ms
2020-09-01 11:50:46,400 INFO
 org.apache.flink.runtime.dispatcher.StandaloneDispatcher      - Received
JobGraph submission 980d3ff229b7fbfe889e2bc93e526da0 (cli-test-001).
2020-09-01 11:50:46,403 INFO
 org.apache.flink.runtime.dispatcher.StandaloneDispatcher      - Submitting
job 980d3ff229b7fbfe889e2bc93e526da0 (cli-test-001).
2020-09-01 11:50:46,405 DEBUG
org.apache.flink.runtime.jobmanager.ZooKeeperSubmittedJobGraphStore  -
Adding job graph 980d3ff229b7fbfe889e2bc93e526da0 to
flink/cluster/jobgraphs/980d3ff229b7fbfe889e2bc93e526da0.
2020-09-01 11:50:47,325 DEBUG
org.apache.flink.runtime.resourcemanager.StandaloneResourceManager  -
Trigger heartbeat request.
2020-09-01 11:50:47,325 DEBUG
org.apache.flink.runtime.resourcemanager.StandaloneResourceManager  -
Trigger heartbeat request.
2020-09-01 11:50:47,325 DEBUG
org.apache.flink.runtime.resourcemanager.StandaloneResourceManager  -
Trigger heartbeat request.
2020-09-01 11:50:47,325 DEBUG
org.apache.flink.runtime.resourcemanager.StandaloneResourceManager  -
Trigger heartbeat request.
2020-09-01 11:50:47,330 DEBUG
org.apache.flink.runtime.resourcemanager.StandaloneResourceManager  -
Received heartbeat from 4e4ae8b90f911787ac112c2847759512.
2020-09-01 11:50:47,331 DEBUG
org.apache.flink.runtime.resourcemanager.StandaloneResourceManager  -
Received heartbeat from 2532ed667a2566e06c3d9e3bc85c6ed6.
2020-09-01 11:50:52,880 DEBUG
org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn  - Got
notification sessionid:0x30be3d929102460
2020-09-01 11:50:52,880 DEBUG
org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn  - Got
WatchedEvent state:SyncConnected type:NodeChildrenChanged
path:/flink/cluster/jobgraphs for sessionid 0x30be3d929102460
2020-09-01 11:50:52,882 INFO
 org.apache.flink.runtime.jobmanager.ZooKeeperSubmittedJobGraphStore  -
Added SubmittedJobGraph(980d3ff229b7fbfe889e2bc93e526da0) to ZooKeeper.

Thank You.

On Wed, Sep 2, 2020 at 2:06 PM Till Rohrmann <trohrm...@apache.org> wrote:

> Hi Prakhar,
>
> have you enabled HA for your cluster? If yes, then Flink will try to store
> the job graph to the configured high-availability.storageDir in order to be
> able to recover it. If this operation takes long, then it is either the
> filesystem which is slow or storing the pointer in ZooKeeper. If it is the
> filesystem, then I would suggest to check whether you have some read/write
> quotas which might slow the operation down.
>
> If you haven't enabled HA or persisting the jobGraph is not what takes
> long, then the next most likely candidate is the recovery from a previous
> checkpoint. Here again, Flink needs to read from the remote storage (in
> your case GCS). Depending on the size of the checkpoint and the read
> bandwidth, this can be faster or slower. The best way to figure out what
> takes long is to share the logs with us so that we can confirm what takes
> long.
>
> To sum it up, the job submission is most likely slow because of the
> interplay of Flink with the external system (most likely your configured
> filesystem). If the filesystem is somewhat throttled, then Flink cannot do
> much about it.
>
> What you could try to do is to check whether your jar contains
> dependencies which are not needed (e.g. Flink dependencies which are
> usually provided by the system). That way you could decrease the size of
> the jar a bit.
>
> Cheers,
> Till
>
> On Wed, Sep 2, 2020 at 9:48 AM Prakhar Mathur <prakha...@gojek.com> wrote:
>
>> Hi,
>>
>> We are currently running Flink 1.9.0. We see a delay of around 20 seconds
>> in order to start a job on a session Flink cluster. We start the job using
>> Flink's monitoring REST API where our jar is already uploaded on Job
>> Manager. Our jar file size is around 200 MB. We are using memory state
>> backend having GCS as remote storage.
>>
>> On running the cluster in debug mode, we observed that generating the
>> plan itself takes around 6 seconds and copying job graph from local to the
>> remote folder takes around 10 seconds.
>>
>> We were wondering whether this delay is expected or if it can be reduced
>> via tweaking any configuration?
>>
>> Thank you. Regards
>> Prakhar Mathur
>>
>

Re: Job Manager taking long time to upload job graph on remote storage

Reply via email to