[ https://issues.apache.org/jira/browse/FLINK-25649?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17482881#comment-17482881 ]
Yang Wang commented on FLINK-25649: ----------------------------------- This might be related with FLINK-22006. When the K8s HA is enabled, the max jobs in a same session cluster is limited by the max concurrent request(default is 64). It also means that we could not run more than 20 jobs in a shared session cluster. Could you please add the following config option to your flink-conf.yaml and have a try again? {code:java} env.java.opts.jobmanager: "-Dkubernetes.max.concurrent.requests=1000"{code} > Scheduling jobs fails with > org.apache.flink.runtime.jobmanager.scheduler.NoResourceAvailableException > ----------------------------------------------------------------------------------------------------- > > Key: FLINK-25649 > URL: https://issues.apache.org/jira/browse/FLINK-25649 > Project: Flink > Issue Type: Bug > Components: Runtime / Coordination > Affects Versions: 1.13.1 > Reporter: Gil De Grove > Priority: Major > Attachments: flink_scheduler_deadlock.json.zip > > > Following comment from Till on this [SO > question|https://stackoverflow.com/questions/70683048/scheduling-jobs-fails-with-org-apache-flink-runtime-jobmanager-scheduler-noresou?noredirect=1#comment124980546_70683048] > h2. *Summary* > We are currently experiencing a scheduling issue with our flink cluster. > The symptoms are that some/most/all (it depend, the symptoms are not always > the same) of our tasks are showed as _SCHEDULED_ but fail after a timeout. > The jobs are them showed a _RUNNING_ > The failing exception is the following one: > {{Caused by: java.util.concurrent.CompletionException: > org.apache.flink.runtime.jobmanager.scheduler.NoResourceAvailableException: > Slot request bulk is not fulfillable! Could not allocate the required slot > within slot request timeout}} > After analysis, we assume (we cannot prove it, as there are not that much > logs for that part of the code) that the failure is due to a deadlock/race > condition that is happening when several jobs are being submitted at the same > time to the flink cluster, even though we have enough slots available in the > cluster. > We actually have the error with 52 available task slots, and have 12 jobs > that are not scheduled. > h2. Additional information > * Flink version: 1.13.1 commit a7f3192 > * Flink cluster in session mode > * 2 Job managers using k8s HA mode (resource requests: 2 CPU, 4Gb Ram, > limits sets on memory to 4Gb) > * 50 task managers with 2 slots each (resource requests: 2 CPUs, 2GB Ram. No > limits set). > * Our Flink cluster is shut down every night, and restarted every morning. > The error seems to occur when a lot of jobs needs to be scheduled. The jobs > are configured to restore their state, and we do not see any issues for jobs > that are being scheduled and run correctly, it seems to really be related to > a scheduling issue. -- This message was sent by Atlassian Jira (v8.20.1#820001)