[ 
https://issues.apache.org/jira/browse/FLINK-25649?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17482307#comment-17482307
 ] 

Zhu Zhu commented on FLINK-25649:
---------------------------------

I took a look at the logs. Looks to me the problem is the slow leader retrieval 
from HA.
Here are some actions with timestamps(in second) of job 
c95464255e459fce2e2677f944e72c33. The job encounters timeout on slot allocation:
1. job scheduling started : 1642664175
2. JM is notified about RM leadership : 1642664401
3. JM registered to RM: NA

It shows problems below:
1. It takes a very long time(174 s) for the job manager to be notified about 
the leadership of RM.
2. The job manager is trying to but cannot register to the resource manager. 
The registration timeouts because the RM cannot retrieve the leadership of the 
job from HA service.

Both problems are related to the HA service. So I think you can check the 
status of the k8s HA service.




> Scheduling jobs fails with 
> org.apache.flink.runtime.jobmanager.scheduler.NoResourceAvailableException
> -----------------------------------------------------------------------------------------------------
>
>                 Key: FLINK-25649
>                 URL: https://issues.apache.org/jira/browse/FLINK-25649
>             Project: Flink
>          Issue Type: Bug
>          Components: Runtime / Coordination
>    Affects Versions: 1.13.1
>            Reporter: Gil De Grove
>            Priority: Major
>         Attachments: flink_scheduler_deadlock.json.zip
>
>
> Following comment from Till on this [SO 
> question|https://stackoverflow.com/questions/70683048/scheduling-jobs-fails-with-org-apache-flink-runtime-jobmanager-scheduler-noresou?noredirect=1#comment124980546_70683048]
> h2. *Summary*
> We are currently experiencing a scheduling issue with our flink cluster.
> The symptoms are that some/most/all (it depend, the symptoms are not always 
> the same) of our tasks are showed as _SCHEDULED_ but fail after a timeout. 
> The jobs are them showed a _RUNNING_
> The failing exception is the following one:
> {{Caused by: java.util.concurrent.CompletionException: 
> org.apache.flink.runtime.jobmanager.scheduler.NoResourceAvailableException: 
> Slot request bulk is not fulfillable! Could not allocate the required slot 
> within slot request timeout}}
> After analysis, we assume (we cannot prove it, as there are not that much 
> logs for that part of the code) that the failure is due to a deadlock/race 
> condition that is happening when several jobs are being submitted at the same 
> time to the flink cluster, even though we have enough slots available in the 
> cluster.
> We actually have the error with 52 available task slots, and have 12 jobs 
> that are not scheduled.
> h2. Additional information
>  * Flink version: 1.13.1 commit a7f3192
>  * Flink cluster in session mode
>  * 2 Job managers using k8s HA mode (resource requests: 2 CPU, 4Gb Ram, 
> limits sets on memory to 4Gb)
>  * 50 task managers with 2 slots each (resource requests: 2 CPUs, 2GB Ram. No 
> limits set).
>  * Our Flink cluster is shut down every night, and restarted every morning. 
> The error seems to occur when a lot of jobs needs to be scheduled. The jobs 
> are configured to restore their state, and we do not see any issues for jobs 
> that are being scheduled and run correctly, it seems to really be related to 
> a scheduling issue.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

Reply via email to