[jira] [Commented] (FLINK-25649) Scheduling jobs fails with org.apache.flink.runtime.jobmanager.scheduler.NoResourceAvailableException

2024-04-02 Thread Hunter Medney (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-25649?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17833077#comment-17833077
 ] 

Hunter Medney commented on FLINK-25649:
---

I was seeing this issue consistently, and [Yang's config 
change|https://issues.apache.org/jira/browse/FLINK-25649?focusedCommentId=17482881=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-17482881]
 worked for me as well on Flink 1.15.4 and k8s operator 1.7.0. Thank you!

> Scheduling jobs fails with 
> org.apache.flink.runtime.jobmanager.scheduler.NoResourceAvailableException
> -
>
> Key: FLINK-25649
> URL: https://issues.apache.org/jira/browse/FLINK-25649
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Coordination
>Affects Versions: 1.13.1
>Reporter: Gil De Grove
>Priority: Major
> Attachments: flink_scheduler_deadlock.json.zip
>
>
> Following comment from Till on this [SO 
> question|https://stackoverflow.com/questions/70683048/scheduling-jobs-fails-with-org-apache-flink-runtime-jobmanager-scheduler-noresou?noredirect=1#comment124980546_70683048]
> h2. *Summary*
> We are currently experiencing a scheduling issue with our flink cluster.
> The symptoms are that some/most/all (it depend, the symptoms are not always 
> the same) of our tasks are showed as _SCHEDULED_ but fail after a timeout. 
> The jobs are them showed a _RUNNING_
> The failing exception is the following one:
> {{Caused by: java.util.concurrent.CompletionException: 
> org.apache.flink.runtime.jobmanager.scheduler.NoResourceAvailableException: 
> Slot request bulk is not fulfillable! Could not allocate the required slot 
> within slot request timeout}}
> After analysis, we assume (we cannot prove it, as there are not that much 
> logs for that part of the code) that the failure is due to a deadlock/race 
> condition that is happening when several jobs are being submitted at the same 
> time to the flink cluster, even though we have enough slots available in the 
> cluster.
> We actually have the error with 52 available task slots, and have 12 jobs 
> that are not scheduled.
> h2. Additional information
>  * Flink version: 1.13.1 commit a7f3192
>  * Flink cluster in session mode
>  * 2 Job managers using k8s HA mode (resource requests: 2 CPU, 4Gb Ram, 
> limits sets on memory to 4Gb)
>  * 50 task managers with 2 slots each (resource requests: 2 CPUs, 2GB Ram. No 
> limits set).
>  * Our Flink cluster is shut down every night, and restarted every morning. 
> The error seems to occur when a lot of jobs needs to be scheduled. The jobs 
> are configured to restore their state, and we do not see any issues for jobs 
> that are being scheduled and run correctly, it seems to really be related to 
> a scheduling issue.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (FLINK-25649) Scheduling jobs fails with org.apache.flink.runtime.jobmanager.scheduler.NoResourceAvailableException

2022-02-09 Thread Gil De Grove (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-25649?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17489506#comment-17489506
 ] 

Gil De Grove commented on FLINK-25649:
--

Hello, 

We tested by increasing the value for the k8s HA, and the bug does not seem to 
happen anymore. 

Do we know if a more future-proof fix is planned for this issue?

 

Thanks again for all the help on this matter!

> Scheduling jobs fails with 
> org.apache.flink.runtime.jobmanager.scheduler.NoResourceAvailableException
> -
>
> Key: FLINK-25649
> URL: https://issues.apache.org/jira/browse/FLINK-25649
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Coordination
>Affects Versions: 1.13.1
>Reporter: Gil De Grove
>Priority: Major
> Attachments: flink_scheduler_deadlock.json.zip
>
>
> Following comment from Till on this [SO 
> question|https://stackoverflow.com/questions/70683048/scheduling-jobs-fails-with-org-apache-flink-runtime-jobmanager-scheduler-noresou?noredirect=1#comment124980546_70683048]
> h2. *Summary*
> We are currently experiencing a scheduling issue with our flink cluster.
> The symptoms are that some/most/all (it depend, the symptoms are not always 
> the same) of our tasks are showed as _SCHEDULED_ but fail after a timeout. 
> The jobs are them showed a _RUNNING_
> The failing exception is the following one:
> {{Caused by: java.util.concurrent.CompletionException: 
> org.apache.flink.runtime.jobmanager.scheduler.NoResourceAvailableException: 
> Slot request bulk is not fulfillable! Could not allocate the required slot 
> within slot request timeout}}
> After analysis, we assume (we cannot prove it, as there are not that much 
> logs for that part of the code) that the failure is due to a deadlock/race 
> condition that is happening when several jobs are being submitted at the same 
> time to the flink cluster, even though we have enough slots available in the 
> cluster.
> We actually have the error with 52 available task slots, and have 12 jobs 
> that are not scheduled.
> h2. Additional information
>  * Flink version: 1.13.1 commit a7f3192
>  * Flink cluster in session mode
>  * 2 Job managers using k8s HA mode (resource requests: 2 CPU, 4Gb Ram, 
> limits sets on memory to 4Gb)
>  * 50 task managers with 2 slots each (resource requests: 2 CPUs, 2GB Ram. No 
> limits set).
>  * Our Flink cluster is shut down every night, and restarted every morning. 
> The error seems to occur when a lot of jobs needs to be scheduled. The jobs 
> are configured to restore their state, and we do not see any issues for jobs 
> that are being scheduled and run correctly, it seems to really be related to 
> a scheduling issue.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (FLINK-25649) Scheduling jobs fails with org.apache.flink.runtime.jobmanager.scheduler.NoResourceAvailableException

2022-01-26 Thread Yang Wang (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-25649?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17482881#comment-17482881
 ] 

Yang Wang commented on FLINK-25649:
---

This might be related with FLINK-22006. When the K8s HA is enabled, the max 
jobs in a same session cluster is limited by the max concurrent request(default 
is 64). It also means that we could not run more than 20 jobs in a shared 
session cluster.

 

Could you please add the following config option to your flink-conf.yaml and 
have a try again?
{code:java}
env.java.opts.jobmanager: "-Dkubernetes.max.concurrent.requests=1000"{code}

> Scheduling jobs fails with 
> org.apache.flink.runtime.jobmanager.scheduler.NoResourceAvailableException
> -
>
> Key: FLINK-25649
> URL: https://issues.apache.org/jira/browse/FLINK-25649
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Coordination
>Affects Versions: 1.13.1
>Reporter: Gil De Grove
>Priority: Major
> Attachments: flink_scheduler_deadlock.json.zip
>
>
> Following comment from Till on this [SO 
> question|https://stackoverflow.com/questions/70683048/scheduling-jobs-fails-with-org-apache-flink-runtime-jobmanager-scheduler-noresou?noredirect=1#comment124980546_70683048]
> h2. *Summary*
> We are currently experiencing a scheduling issue with our flink cluster.
> The symptoms are that some/most/all (it depend, the symptoms are not always 
> the same) of our tasks are showed as _SCHEDULED_ but fail after a timeout. 
> The jobs are them showed a _RUNNING_
> The failing exception is the following one:
> {{Caused by: java.util.concurrent.CompletionException: 
> org.apache.flink.runtime.jobmanager.scheduler.NoResourceAvailableException: 
> Slot request bulk is not fulfillable! Could not allocate the required slot 
> within slot request timeout}}
> After analysis, we assume (we cannot prove it, as there are not that much 
> logs for that part of the code) that the failure is due to a deadlock/race 
> condition that is happening when several jobs are being submitted at the same 
> time to the flink cluster, even though we have enough slots available in the 
> cluster.
> We actually have the error with 52 available task slots, and have 12 jobs 
> that are not scheduled.
> h2. Additional information
>  * Flink version: 1.13.1 commit a7f3192
>  * Flink cluster in session mode
>  * 2 Job managers using k8s HA mode (resource requests: 2 CPU, 4Gb Ram, 
> limits sets on memory to 4Gb)
>  * 50 task managers with 2 slots each (resource requests: 2 CPUs, 2GB Ram. No 
> limits set).
>  * Our Flink cluster is shut down every night, and restarted every morning. 
> The error seems to occur when a lot of jobs needs to be scheduled. The jobs 
> are configured to restore their state, and we do not see any issues for jobs 
> that are being scheduled and run correctly, it seems to really be related to 
> a scheduling issue.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (FLINK-25649) Scheduling jobs fails with org.apache.flink.runtime.jobmanager.scheduler.NoResourceAvailableException

2022-01-26 Thread Zhu Zhu (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-25649?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17482848#comment-17482848
 ] 

Zhu Zhu commented on FLINK-25649:
-

Yes, your Flink session uses kubernetes to serve as its HA service, and the HA 
service seems to be slow/stuck.

[~yangwang166] would you help to take a look?

> Scheduling jobs fails with 
> org.apache.flink.runtime.jobmanager.scheduler.NoResourceAvailableException
> -
>
> Key: FLINK-25649
> URL: https://issues.apache.org/jira/browse/FLINK-25649
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Coordination
>Affects Versions: 1.13.1
>Reporter: Gil De Grove
>Priority: Major
> Attachments: flink_scheduler_deadlock.json.zip
>
>
> Following comment from Till on this [SO 
> question|https://stackoverflow.com/questions/70683048/scheduling-jobs-fails-with-org-apache-flink-runtime-jobmanager-scheduler-noresou?noredirect=1#comment124980546_70683048]
> h2. *Summary*
> We are currently experiencing a scheduling issue with our flink cluster.
> The symptoms are that some/most/all (it depend, the symptoms are not always 
> the same) of our tasks are showed as _SCHEDULED_ but fail after a timeout. 
> The jobs are them showed a _RUNNING_
> The failing exception is the following one:
> {{Caused by: java.util.concurrent.CompletionException: 
> org.apache.flink.runtime.jobmanager.scheduler.NoResourceAvailableException: 
> Slot request bulk is not fulfillable! Could not allocate the required slot 
> within slot request timeout}}
> After analysis, we assume (we cannot prove it, as there are not that much 
> logs for that part of the code) that the failure is due to a deadlock/race 
> condition that is happening when several jobs are being submitted at the same 
> time to the flink cluster, even though we have enough slots available in the 
> cluster.
> We actually have the error with 52 available task slots, and have 12 jobs 
> that are not scheduled.
> h2. Additional information
>  * Flink version: 1.13.1 commit a7f3192
>  * Flink cluster in session mode
>  * 2 Job managers using k8s HA mode (resource requests: 2 CPU, 4Gb Ram, 
> limits sets on memory to 4Gb)
>  * 50 task managers with 2 slots each (resource requests: 2 CPUs, 2GB Ram. No 
> limits set).
>  * Our Flink cluster is shut down every night, and restarted every morning. 
> The error seems to occur when a lot of jobs needs to be scheduled. The jobs 
> are configured to restore their state, and we do not see any issues for jobs 
> that are being scheduled and run correctly, it seems to really be related to 
> a scheduling issue.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (FLINK-25649) Scheduling jobs fails with org.apache.flink.runtime.jobmanager.scheduler.NoResourceAvailableException

2022-01-26 Thread Gil De Grove (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-25649?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17482589#comment-17482589
 ] 

Gil De Grove commented on FLINK-25649:
--

[~zhuzh] 

I'm not sure if I get everything on your explanation. But from what I get, you 
mean that the actual root cause of the job not being able to be scheduled is 
the HA service in the Job manager.


How do you propose to proceed on checking this?

> Scheduling jobs fails with 
> org.apache.flink.runtime.jobmanager.scheduler.NoResourceAvailableException
> -
>
> Key: FLINK-25649
> URL: https://issues.apache.org/jira/browse/FLINK-25649
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Coordination
>Affects Versions: 1.13.1
>Reporter: Gil De Grove
>Priority: Major
> Attachments: flink_scheduler_deadlock.json.zip
>
>
> Following comment from Till on this [SO 
> question|https://stackoverflow.com/questions/70683048/scheduling-jobs-fails-with-org-apache-flink-runtime-jobmanager-scheduler-noresou?noredirect=1#comment124980546_70683048]
> h2. *Summary*
> We are currently experiencing a scheduling issue with our flink cluster.
> The symptoms are that some/most/all (it depend, the symptoms are not always 
> the same) of our tasks are showed as _SCHEDULED_ but fail after a timeout. 
> The jobs are them showed a _RUNNING_
> The failing exception is the following one:
> {{Caused by: java.util.concurrent.CompletionException: 
> org.apache.flink.runtime.jobmanager.scheduler.NoResourceAvailableException: 
> Slot request bulk is not fulfillable! Could not allocate the required slot 
> within slot request timeout}}
> After analysis, we assume (we cannot prove it, as there are not that much 
> logs for that part of the code) that the failure is due to a deadlock/race 
> condition that is happening when several jobs are being submitted at the same 
> time to the flink cluster, even though we have enough slots available in the 
> cluster.
> We actually have the error with 52 available task slots, and have 12 jobs 
> that are not scheduled.
> h2. Additional information
>  * Flink version: 1.13.1 commit a7f3192
>  * Flink cluster in session mode
>  * 2 Job managers using k8s HA mode (resource requests: 2 CPU, 4Gb Ram, 
> limits sets on memory to 4Gb)
>  * 50 task managers with 2 slots each (resource requests: 2 CPUs, 2GB Ram. No 
> limits set).
>  * Our Flink cluster is shut down every night, and restarted every morning. 
> The error seems to occur when a lot of jobs needs to be scheduled. The jobs 
> are configured to restore their state, and we do not see any issues for jobs 
> that are being scheduled and run correctly, it seems to really be related to 
> a scheduling issue.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (FLINK-25649) Scheduling jobs fails with org.apache.flink.runtime.jobmanager.scheduler.NoResourceAvailableException

2022-01-25 Thread Zhu Zhu (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-25649?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17482307#comment-17482307
 ] 

Zhu Zhu commented on FLINK-25649:
-

I took a look at the logs. Looks to me the problem is the slow leader retrieval 
from HA.
Here are some actions with timestamps(in second) of job 
c95464255e459fce2e2677f944e72c33. The job encounters timeout on slot allocation:
1. job scheduling started : 1642664175
2. JM is notified about RM leadership : 1642664401
3. JM registered to RM: NA

It shows problems below:
1. It takes a very long time(174 s) for the job manager to be notified about 
the leadership of RM.
2. The job manager is trying to but cannot register to the resource manager. 
The registration timeouts because the RM cannot retrieve the leadership of the 
job from HA service.

Both problems are related to the HA service. So I think you can check the 
status of the k8s HA service.




> Scheduling jobs fails with 
> org.apache.flink.runtime.jobmanager.scheduler.NoResourceAvailableException
> -
>
> Key: FLINK-25649
> URL: https://issues.apache.org/jira/browse/FLINK-25649
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Coordination
>Affects Versions: 1.13.1
>Reporter: Gil De Grove
>Priority: Major
> Attachments: flink_scheduler_deadlock.json.zip
>
>
> Following comment from Till on this [SO 
> question|https://stackoverflow.com/questions/70683048/scheduling-jobs-fails-with-org-apache-flink-runtime-jobmanager-scheduler-noresou?noredirect=1#comment124980546_70683048]
> h2. *Summary*
> We are currently experiencing a scheduling issue with our flink cluster.
> The symptoms are that some/most/all (it depend, the symptoms are not always 
> the same) of our tasks are showed as _SCHEDULED_ but fail after a timeout. 
> The jobs are them showed a _RUNNING_
> The failing exception is the following one:
> {{Caused by: java.util.concurrent.CompletionException: 
> org.apache.flink.runtime.jobmanager.scheduler.NoResourceAvailableException: 
> Slot request bulk is not fulfillable! Could not allocate the required slot 
> within slot request timeout}}
> After analysis, we assume (we cannot prove it, as there are not that much 
> logs for that part of the code) that the failure is due to a deadlock/race 
> condition that is happening when several jobs are being submitted at the same 
> time to the flink cluster, even though we have enough slots available in the 
> cluster.
> We actually have the error with 52 available task slots, and have 12 jobs 
> that are not scheduled.
> h2. Additional information
>  * Flink version: 1.13.1 commit a7f3192
>  * Flink cluster in session mode
>  * 2 Job managers using k8s HA mode (resource requests: 2 CPU, 4Gb Ram, 
> limits sets on memory to 4Gb)
>  * 50 task managers with 2 slots each (resource requests: 2 CPUs, 2GB Ram. No 
> limits set).
>  * Our Flink cluster is shut down every night, and restarted every morning. 
> The error seems to occur when a lot of jobs needs to be scheduled. The jobs 
> are configured to restore their state, and we do not see any issues for jobs 
> that are being scheduled and run correctly, it seems to really be related to 
> a scheduling issue.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (FLINK-25649) Scheduling jobs fails with org.apache.flink.runtime.jobmanager.scheduler.NoResourceAvailableException

2022-01-23 Thread Gil De Grove (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-25649?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17480881#comment-17480881
 ] 

Gil De Grove commented on FLINK-25649:
--

Hello [~trohrmann], [~zhuzh]

Thanks for the reply, due to the answer, I decided to went for a full 
anonymization, we removed all data that felt confidential, and replaced then 
with a tag, if you feel that those logs contains important information, we can 
get the, from the raw source.

To note;
- we removed the apache.wire, apache.http.headers and aws.requests and aws 
S3client that wre quite vorbose from the logs to try to keep only the one 
interesting for the issue at hand. 
- The ID of the jobs are still present in the logs. 
- We are using json formated logs, I kep them in that format in the logs.

 

 

> Scheduling jobs fails with 
> org.apache.flink.runtime.jobmanager.scheduler.NoResourceAvailableException
> -
>
> Key: FLINK-25649
> URL: https://issues.apache.org/jira/browse/FLINK-25649
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Coordination
>Affects Versions: 1.13.1
>Reporter: Gil De Grove
>Priority: Major
> Attachments: flink_scheduler_deadlock.json.zip
>
>
> Following comment from Till on this [SO 
> question|https://stackoverflow.com/questions/70683048/scheduling-jobs-fails-with-org-apache-flink-runtime-jobmanager-scheduler-noresou?noredirect=1#comment124980546_70683048]
> h2. *Summary*
> We are currently experiencing a scheduling issue with our flink cluster.
> The symptoms are that some/most/all (it depend, the symptoms are not always 
> the same) of our tasks are showed as _SCHEDULED_ but fail after a timeout. 
> The jobs are them showed a _RUNNING_
> The failing exception is the following one:
> {{Caused by: java.util.concurrent.CompletionException: 
> org.apache.flink.runtime.jobmanager.scheduler.NoResourceAvailableException: 
> Slot request bulk is not fulfillable! Could not allocate the required slot 
> within slot request timeout}}
> After analysis, we assume (we cannot prove it, as there are not that much 
> logs for that part of the code) that the failure is due to a deadlock/race 
> condition that is happening when several jobs are being submitted at the same 
> time to the flink cluster, even though we have enough slots available in the 
> cluster.
> We actually have the error with 52 available task slots, and have 12 jobs 
> that are not scheduled.
> h2. Additional information
>  * Flink version: 1.13.1 commit a7f3192
>  * Flink cluster in session mode
>  * 2 Job managers using k8s HA mode (resource requests: 2 CPU, 4Gb Ram, 
> limits sets on memory to 4Gb)
>  * 50 task managers with 2 slots each (resource requests: 2 CPUs, 2GB Ram. No 
> limits set).
>  * Our Flink cluster is shut down every night, and restarted every morning. 
> The error seems to occur when a lot of jobs needs to be scheduled. The jobs 
> are configured to restore their state, and we do not see any issues for jobs 
> that are being scheduled and run correctly, it seems to really be related to 
> a scheduling issue.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (FLINK-25649) Scheduling jobs fails with org.apache.flink.runtime.jobmanager.scheduler.NoResourceAvailableException

2022-01-19 Thread Zhu Zhu (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-25649?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17479064#comment-17479064
 ] 

Zhu Zhu commented on FLINK-25649:
-

You can also send the logs to me (reed...@gmail.com).

> Scheduling jobs fails with 
> org.apache.flink.runtime.jobmanager.scheduler.NoResourceAvailableException
> -
>
> Key: FLINK-25649
> URL: https://issues.apache.org/jira/browse/FLINK-25649
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Coordination
>Affects Versions: 1.13.1
>Reporter: Gil De Grove
>Priority: Major
>
> Following comment from Till on this [SO 
> question|https://stackoverflow.com/questions/70683048/scheduling-jobs-fails-with-org-apache-flink-runtime-jobmanager-scheduler-noresou?noredirect=1#comment124980546_70683048]
> h2. *Summary*
> We are currently experiencing a scheduling issue with our flink cluster.
> The symptoms are that some/most/all (it depend, the symptoms are not always 
> the same) of our tasks are showed as _SCHEDULED_ but fail after a timeout. 
> The jobs are them showed a _RUNNING_
> The failing exception is the following one:
> {{Caused by: java.util.concurrent.CompletionException: 
> org.apache.flink.runtime.jobmanager.scheduler.NoResourceAvailableException: 
> Slot request bulk is not fulfillable! Could not allocate the required slot 
> within slot request timeout}}
> After analysis, we assume (we cannot prove it, as there are not that much 
> logs for that part of the code) that the failure is due to a deadlock/race 
> condition that is happening when several jobs are being submitted at the same 
> time to the flink cluster, even though we have enough slots available in the 
> cluster.
> We actually have the error with 52 available task slots, and have 12 jobs 
> that are not scheduled.
> h2. Additional information
>  * Flink version: 1.13.1 commit a7f3192
>  * Flink cluster in session mode
>  * 2 Job managers using k8s HA mode (resource requests: 2 CPU, 4Gb Ram, 
> limits sets on memory to 4Gb)
>  * 50 task managers with 2 slots each (resource requests: 2 CPUs, 2GB Ram. No 
> limits set).
>  * Our Flink cluster is shut down every night, and restarted every morning. 
> The error seems to occur when a lot of jobs needs to be scheduled. The jobs 
> are configured to restore their state, and we do not see any issues for jobs 
> that are being scheduled and run correctly, it seems to really be related to 
> a scheduling issue.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (FLINK-25649) Scheduling jobs fails with org.apache.flink.runtime.jobmanager.scheduler.NoResourceAvailableException

2022-01-19 Thread Till Rohrmann (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-25649?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17478527#comment-17478527
 ] 

Till Rohrmann commented on FLINK-25649:
---

You can share it with me (trohrm...@apache.org) but I am not sure how fast I 
get to analyse it. Sharing it with the community would have the benefit that we 
probably find the root cause faster.

> Scheduling jobs fails with 
> org.apache.flink.runtime.jobmanager.scheduler.NoResourceAvailableException
> -
>
> Key: FLINK-25649
> URL: https://issues.apache.org/jira/browse/FLINK-25649
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Coordination
>Affects Versions: 1.13.1
>Reporter: Gil De Grove
>Priority: Major
>
> Following comment from Till on this [SO 
> question|https://stackoverflow.com/questions/70683048/scheduling-jobs-fails-with-org-apache-flink-runtime-jobmanager-scheduler-noresou?noredirect=1#comment124980546_70683048]
> h2. *Summary*
> We are currently experiencing a scheduling issue with our flink cluster.
> The symptoms are that some/most/all (it depend, the symptoms are not always 
> the same) of our tasks are showed as _SCHEDULED_ but fail after a timeout. 
> The jobs are them showed a _RUNNING_
> The failing exception is the following one:
> {{Caused by: java.util.concurrent.CompletionException: 
> org.apache.flink.runtime.jobmanager.scheduler.NoResourceAvailableException: 
> Slot request bulk is not fulfillable! Could not allocate the required slot 
> within slot request timeout}}
> After analysis, we assume (we cannot prove it, as there are not that much 
> logs for that part of the code) that the failure is due to a deadlock/race 
> condition that is happening when several jobs are being submitted at the same 
> time to the flink cluster, even though we have enough slots available in the 
> cluster.
> We actually have the error with 52 available task slots, and have 12 jobs 
> that are not scheduled.
> h2. Additional information
>  * Flink version: 1.13.1 commit a7f3192
>  * Flink cluster in session mode
>  * 2 Job managers using k8s HA mode (resource requests: 2 CPU, 4Gb Ram, 
> limits sets on memory to 4Gb)
>  * 50 task managers with 2 slots each (resource requests: 2 CPUs, 2GB Ram. No 
> limits set).
>  * Our Flink cluster is shut down every night, and restarted every morning. 
> The error seems to occur when a lot of jobs needs to be scheduled. The jobs 
> are configured to restore their state, and we do not see any issues for jobs 
> that are being scheduled and run correctly, it seems to really be related to 
> a scheduling issue.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (FLINK-25649) Scheduling jobs fails with org.apache.flink.runtime.jobmanager.scheduler.NoResourceAvailableException

2022-01-19 Thread Gil De Grove (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-25649?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17478449#comment-17478449
 ] 

Gil De Grove commented on FLINK-25649:
--

[~zhuzh] [~trohrmann], 

 

The logs are quite heavy, and contains some confidential information.

Would it be possible to only share them with you two, or to a specific email 
address to avoid any leak in the apache Jira ? If not, we will remove the log 
that may contain the confidential informations

> Scheduling jobs fails with 
> org.apache.flink.runtime.jobmanager.scheduler.NoResourceAvailableException
> -
>
> Key: FLINK-25649
> URL: https://issues.apache.org/jira/browse/FLINK-25649
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Coordination
>Affects Versions: 1.13.1
>Reporter: Gil De Grove
>Priority: Major
>
> Following comment from Till on this [SO 
> question|https://stackoverflow.com/questions/70683048/scheduling-jobs-fails-with-org-apache-flink-runtime-jobmanager-scheduler-noresou?noredirect=1#comment124980546_70683048]
> h2. *Summary*
> We are currently experiencing a scheduling issue with our flink cluster.
> The symptoms are that some/most/all (it depend, the symptoms are not always 
> the same) of our tasks are showed as _SCHEDULED_ but fail after a timeout. 
> The jobs are them showed a _RUNNING_
> The failing exception is the following one:
> {{Caused by: java.util.concurrent.CompletionException: 
> org.apache.flink.runtime.jobmanager.scheduler.NoResourceAvailableException: 
> Slot request bulk is not fulfillable! Could not allocate the required slot 
> within slot request timeout}}
> After analysis, we assume (we cannot prove it, as there are not that much 
> logs for that part of the code) that the failure is due to a deadlock/race 
> condition that is happening when several jobs are being submitted at the same 
> time to the flink cluster, even though we have enough slots available in the 
> cluster.
> We actually have the error with 52 available task slots, and have 12 jobs 
> that are not scheduled.
> h2. Additional information
>  * Flink version: 1.13.1 commit a7f3192
>  * Flink cluster in session mode
>  * 2 Job managers using k8s HA mode (resource requests: 2 CPU, 4Gb Ram, 
> limits sets on memory to 4Gb)
>  * 50 task managers with 2 slots each (resource requests: 2 CPUs, 2GB Ram. No 
> limits set).
>  * Our Flink cluster is shut down every night, and restarted every morning. 
> The error seems to occur when a lot of jobs needs to be scheduled. The jobs 
> are configured to restore their state, and we do not see any issues for jobs 
> that are being scheduled and run correctly, it seems to really be related to 
> a scheduling issue.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (FLINK-25649) Scheduling jobs fails with org.apache.flink.runtime.jobmanager.scheduler.NoResourceAvailableException

2022-01-19 Thread Gil De Grove (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-25649?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17478436#comment-17478436
 ] 

Gil De Grove commented on FLINK-25649:
--

[~zhuzh] [~trohrmann] ,

I was able to reproduce the issue this morning, I'm gathering the logs and will 
clean them then upload it there, the logs will contain the logs for both the 
job-manager of our HA deployment. 

This morning, I actually only have DataStream (full stream) jobs that run, 3 of 
them cannot be scheduled, they all have a parallelism of 3. 

The job-manager reports 30 task slots available. 

>> One question I have is that when you say "100% of the jobs working", do you 
>> mean "All tasks of all jobs are in RUNNING state at the same time", or "All 
>> bounded jobs can finish and all unbounded jobs' tasks are in RUNING state"?

I mean; all the tasks of all the jobs, bounded or unbounded, are scheduled at 
the same time on the cluster.

> Scheduling jobs fails with 
> org.apache.flink.runtime.jobmanager.scheduler.NoResourceAvailableException
> -
>
> Key: FLINK-25649
> URL: https://issues.apache.org/jira/browse/FLINK-25649
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Coordination
>Affects Versions: 1.13.1
>Reporter: Gil De Grove
>Priority: Major
>
> Following comment from Till on this [SO 
> question|https://stackoverflow.com/questions/70683048/scheduling-jobs-fails-with-org-apache-flink-runtime-jobmanager-scheduler-noresou?noredirect=1#comment124980546_70683048]
> h2. *Summary*
> We are currently experiencing a scheduling issue with our flink cluster.
> The symptoms are that some/most/all (it depend, the symptoms are not always 
> the same) of our tasks are showed as _SCHEDULED_ but fail after a timeout. 
> The jobs are them showed a _RUNNING_
> The failing exception is the following one:
> {{Caused by: java.util.concurrent.CompletionException: 
> org.apache.flink.runtime.jobmanager.scheduler.NoResourceAvailableException: 
> Slot request bulk is not fulfillable! Could not allocate the required slot 
> within slot request timeout}}
> After analysis, we assume (we cannot prove it, as there are not that much 
> logs for that part of the code) that the failure is due to a deadlock/race 
> condition that is happening when several jobs are being submitted at the same 
> time to the flink cluster, even though we have enough slots available in the 
> cluster.
> We actually have the error with 52 available task slots, and have 12 jobs 
> that are not scheduled.
> h2. Additional information
>  * Flink version: 1.13.1 commit a7f3192
>  * Flink cluster in session mode
>  * 2 Job managers using k8s HA mode (resource requests: 2 CPU, 4Gb Ram, 
> limits sets on memory to 4Gb)
>  * 50 task managers with 2 slots each (resource requests: 2 CPUs, 2GB Ram. No 
> limits set).
>  * Our Flink cluster is shut down every night, and restarted every morning. 
> The error seems to occur when a lot of jobs needs to be scheduled. The jobs 
> are configured to restore their state, and we do not see any issues for jobs 
> that are being scheduled and run correctly, it seems to really be related to 
> a scheduling issue.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (FLINK-25649) Scheduling jobs fails with org.apache.flink.runtime.jobmanager.scheduler.NoResourceAvailableException

2022-01-18 Thread Zhu Zhu (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-25649?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17478365#comment-17478365
 ] 

Zhu Zhu commented on FLINK-25649:
-

>> Would it make sens to say that we encounter this issue because 2 jobs are 
>> waiting for the same slots to be available when slots are available in the 
>> cluster?
Probably not. This is not the designed behavior of slot allocation of Flink.

One question I have is that when you say "100% of the jobs working", do you 
mean "All tasks of all jobs are in RUNNING state at the same time", or "All 
bounded jobs can finish and all unbounded jobs' tasks are in RUNING state"?

And I second Till that the debug logs is very important to find the root cause. 
It would be very helpful if you can share it.



> Scheduling jobs fails with 
> org.apache.flink.runtime.jobmanager.scheduler.NoResourceAvailableException
> -
>
> Key: FLINK-25649
> URL: https://issues.apache.org/jira/browse/FLINK-25649
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Coordination
>Affects Versions: 1.13.1
>Reporter: Gil De Grove
>Priority: Major
>
> Following comment from Till on this [SO 
> question|https://stackoverflow.com/questions/70683048/scheduling-jobs-fails-with-org-apache-flink-runtime-jobmanager-scheduler-noresou?noredirect=1#comment124980546_70683048]
> h2. *Summary*
> We are currently experiencing a scheduling issue with our flink cluster.
> The symptoms are that some/most/all (it depend, the symptoms are not always 
> the same) of our tasks are showed as _SCHEDULED_ but fail after a timeout. 
> The jobs are them showed a _RUNNING_
> The failing exception is the following one:
> {{Caused by: java.util.concurrent.CompletionException: 
> org.apache.flink.runtime.jobmanager.scheduler.NoResourceAvailableException: 
> Slot request bulk is not fulfillable! Could not allocate the required slot 
> within slot request timeout}}
> After analysis, we assume (we cannot prove it, as there are not that much 
> logs for that part of the code) that the failure is due to a deadlock/race 
> condition that is happening when several jobs are being submitted at the same 
> time to the flink cluster, even though we have enough slots available in the 
> cluster.
> We actually have the error with 52 available task slots, and have 12 jobs 
> that are not scheduled.
> h2. Additional information
>  * Flink version: 1.13.1 commit a7f3192
>  * Flink cluster in session mode
>  * 2 Job managers using k8s HA mode (resource requests: 2 CPU, 4Gb Ram, 
> limits sets on memory to 4Gb)
>  * 50 task managers with 2 slots each (resource requests: 2 CPUs, 2GB Ram. No 
> limits set).
>  * Our Flink cluster is shut down every night, and restarted every morning. 
> The error seems to occur when a lot of jobs needs to be scheduled. The jobs 
> are configured to restore their state, and we do not see any issues for jobs 
> that are being scheduled and run correctly, it seems to really be related to 
> a scheduling issue.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (FLINK-25649) Scheduling jobs fails with org.apache.flink.runtime.jobmanager.scheduler.NoResourceAvailableException

2022-01-18 Thread Till Rohrmann (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-25649?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17477705#comment-17477705
 ] 

Till Rohrmann commented on FLINK-25649:
---

Do you have by chance the logs of the deadlocked run [~gdegrove]? I think they 
could be quite helpful for finding the root cause of the problem. If the 
problem is reproducible, then {{DEBUG}} logs would be awesome.

> Scheduling jobs fails with 
> org.apache.flink.runtime.jobmanager.scheduler.NoResourceAvailableException
> -
>
> Key: FLINK-25649
> URL: https://issues.apache.org/jira/browse/FLINK-25649
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Coordination
>Affects Versions: 1.13.1
>Reporter: Gil De Grove
>Priority: Major
>
> Following comment from Till on this [SO 
> question|https://stackoverflow.com/questions/70683048/scheduling-jobs-fails-with-org-apache-flink-runtime-jobmanager-scheduler-noresou?noredirect=1#comment124980546_70683048]
> h2. *Summary*
> We are currently experiencing a scheduling issue with our flink cluster.
> The symptoms are that some/most/all (it depend, the symptoms are not always 
> the same) of our tasks are showed as _SCHEDULED_ but fail after a timeout. 
> The jobs are them showed a _RUNNING_
> The failing exception is the following one:
> {{Caused by: java.util.concurrent.CompletionException: 
> org.apache.flink.runtime.jobmanager.scheduler.NoResourceAvailableException: 
> Slot request bulk is not fulfillable! Could not allocate the required slot 
> within slot request timeout}}
> After analysis, we assume (we cannot prove it, as there are not that much 
> logs for that part of the code) that the failure is due to a deadlock/race 
> condition that is happening when several jobs are being submitted at the same 
> time to the flink cluster, even though we have enough slots available in the 
> cluster.
> We actually have the error with 52 available task slots, and have 12 jobs 
> that are not scheduled.
> h2. Additional information
>  * Flink version: 1.13.1 commit a7f3192
>  * Flink cluster in session mode
>  * 2 Job managers using k8s HA mode (resource requests: 2 CPU, 4Gb Ram, 
> limits sets on memory to 4Gb)
>  * 50 task managers with 2 slots each (resource requests: 2 CPUs, 2GB Ram. No 
> limits set).
>  * Our Flink cluster is shut down every night, and restarted every morning. 
> The error seems to occur when a lot of jobs needs to be scheduled. The jobs 
> are configured to restore their state, and we do not see any issues for jobs 
> that are being scheduled and run correctly, it seems to really be related to 
> a scheduling issue.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (FLINK-25649) Scheduling jobs fails with org.apache.flink.runtime.jobmanager.scheduler.NoResourceAvailableException

2022-01-17 Thread Gil De Grove (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-25649?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17477294#comment-17477294
 ] 

Gil De Grove commented on FLINK-25649:
--

[~trohrmann], we are not configuring anything related to sharing group, Flink 
jobs in the cluster, without anything special about resources.

[~zhuzh], I'm not sure I get it, the cluster has enough resource as the same 
set of resources can be executed in the cluster with exactly the same 
configuration. From what we have seen this issue is non deterministic, as with 
the same set of resource and jobs, we may have a random number of job set as 
scheduled. Sometimes we have 100% of the jobs working, sometimes we have only 
25% of them running, sometimes it's 90%.

This is what makes us think that the issue is not related in the available 
resources available, but more in the way the cluster checks for the available 
resources.

Would it make sens to say that we encounter this issue because 2 jobs are 
waiting for the same slots to be available when slots are available in the 
cluster?

That could be explained if Flink tries to collocate all the slots of the same 
job to the same task manager(or same set of taskmanagers). That would be 
explained if the scheduler schedule all the jobs at the same time, instead of 
picking them one by one from a queue. 

> Scheduling jobs fails with 
> org.apache.flink.runtime.jobmanager.scheduler.NoResourceAvailableException
> -
>
> Key: FLINK-25649
> URL: https://issues.apache.org/jira/browse/FLINK-25649
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Coordination
>Affects Versions: 1.13.1
>Reporter: Gil De Grove
>Priority: Major
>
> Following comment from Till on this [SO 
> question|https://stackoverflow.com/questions/70683048/scheduling-jobs-fails-with-org-apache-flink-runtime-jobmanager-scheduler-noresou?noredirect=1#comment124980546_70683048]
> h2. *Summary*
> We are currently experiencing a scheduling issue with our flink cluster.
> The symptoms are that some/most/all (it depend, the symptoms are not always 
> the same) of our tasks are showed as _SCHEDULED_ but fail after a timeout. 
> The jobs are them showed a _RUNNING_
> The failing exception is the following one:
> {{Caused by: java.util.concurrent.CompletionException: 
> org.apache.flink.runtime.jobmanager.scheduler.NoResourceAvailableException: 
> Slot request bulk is not fulfillable! Could not allocate the required slot 
> within slot request timeout}}
> After analysis, we assume (we cannot prove it, as there are not that much 
> logs for that part of the code) that the failure is due to a deadlock/race 
> condition that is happening when several jobs are being submitted at the same 
> time to the flink cluster, even though we have enough slots available in the 
> cluster.
> We actually have the error with 52 available task slots, and have 12 jobs 
> that are not scheduled.
> h2. Additional information
>  * Flink version: 1.13.1 commit a7f3192
>  * Flink cluster in session mode
>  * 2 Job managers using k8s HA mode (resource requests: 2 CPU, 4Gb Ram, 
> limits sets on memory to 4Gb)
>  * 50 task managers with 2 slots each (resource requests: 2 CPUs, 2GB Ram. No 
> limits set).
>  * Our Flink cluster is shut down every night, and restarted every morning. 
> The error seems to occur when a lot of jobs needs to be scheduled. The jobs 
> are configured to restore their state, and we do not see any issues for jobs 
> that are being scheduled and run correctly, it seems to really be related to 
> a scheduling issue.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (FLINK-25649) Scheduling jobs fails with org.apache.flink.runtime.jobmanager.scheduler.NoResourceAvailableException

2022-01-17 Thread Zhu Zhu (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-25649?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17477072#comment-17477072
 ] 

Zhu Zhu commented on FLINK-25649:
-

>From your description, it's possible that your cluster does not have enough 
>resources to run all job *at the same time*.
There is a known issue of session cluster resource deadlocks if it runs 
multiple bounded streaming jobs(or batch jobs with PIPELINED edges).
For example, there is a cluster have 30 slots. We want to run 2 bounded 
streaming jobs on it, each job needs 20 slots. 
- If the 2 jobs are submitted at the same time, there is *chance* that the 2 
jobs each acquired 15 slots and neither can run. The slots will be kept by the 
SlotPool of each job. The slots cannot be returned to the ResourceManager 
because they will be requested by the job again before reaching the idle 
timeout.
- If the 2 jobs are submitted with an interval, the earlier job may acquire 20 
slots and start running. The latter submitted one can only acquire 10 slots and 
have to wait, but after the earlier job finishes, the slots will be returned 
and the latter can start running.

If this is your case, you can try setting config `slot.idle.timeout` to `0`, 
and maybe increase config `restart-strategy.fixed-delay.delay` a bit. So that 
when a job failed to acquire enough slots within timeout, it will soon return 
all the slots to the ResourceManager and other jobs will be able get the slots 
to run. This may mitigate your problem, although slot requests timeout may 
still happen at times.

> Scheduling jobs fails with 
> org.apache.flink.runtime.jobmanager.scheduler.NoResourceAvailableException
> -
>
> Key: FLINK-25649
> URL: https://issues.apache.org/jira/browse/FLINK-25649
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Coordination
>Affects Versions: 1.13.1
>Reporter: Gil De Grove
>Priority: Major
>
> Following comment from Till on this [SO 
> question|https://stackoverflow.com/questions/70683048/scheduling-jobs-fails-with-org-apache-flink-runtime-jobmanager-scheduler-noresou?noredirect=1#comment124980546_70683048]
> h2. *Summary*
> We are currently experiencing a scheduling issue with our flink cluster.
> The symptoms are that some/most/all (it depend, the symptoms are not always 
> the same) of our tasks are showed as _SCHEDULED_ but fail after a timeout. 
> The jobs are them showed a _RUNNING_
> The failing exception is the following one:
> {{Caused by: java.util.concurrent.CompletionException: 
> org.apache.flink.runtime.jobmanager.scheduler.NoResourceAvailableException: 
> Slot request bulk is not fulfillable! Could not allocate the required slot 
> within slot request timeout}}
> After analysis, we assume (we cannot prove it, as there are not that much 
> logs for that part of the code) that the failure is due to a deadlock/race 
> condition that is happening when several jobs are being submitted at the same 
> time to the flink cluster, even though we have enough slots available in the 
> cluster.
> We actually have the error with 52 available task slots, and have 12 jobs 
> that are not scheduled.
> h2. Additional information
>  * Flink version: 1.13.1 commit a7f3192
>  * Flink cluster in session mode
>  * 2 Job managers using k8s HA mode (resource requests: 2 CPU, 4Gb Ram, 
> limits sets on memory to 4Gb)
>  * 50 task managers with 2 slots each (resource requests: 2 CPUs, 2GB Ram. No 
> limits set).
>  * Our Flink cluster is shut down every night, and restarted every morning. 
> The error seems to occur when a lot of jobs needs to be scheduled. The jobs 
> are configured to restore their state, and we do not see any issues for jobs 
> that are being scheduled and run correctly, it seems to really be related to 
> a scheduling issue.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (FLINK-25649) Scheduling jobs fails with org.apache.flink.runtime.jobmanager.scheduler.NoResourceAvailableException

2022-01-17 Thread Till Rohrmann (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-25649?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17477060#comment-17477060
 ] 

Till Rohrmann commented on FLINK-25649:
---

Are you configuring non-default slot sharing groups [~gdegrove]?

> Scheduling jobs fails with 
> org.apache.flink.runtime.jobmanager.scheduler.NoResourceAvailableException
> -
>
> Key: FLINK-25649
> URL: https://issues.apache.org/jira/browse/FLINK-25649
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Coordination
>Affects Versions: 1.13.1
>Reporter: Gil De Grove
>Priority: Major
>
> Following comment from Till on this [SO 
> question|https://stackoverflow.com/questions/70683048/scheduling-jobs-fails-with-org-apache-flink-runtime-jobmanager-scheduler-noresou?noredirect=1#comment124980546_70683048]
> h2. *Summary*
> We are currently experiencing a scheduling issue with our flink cluster.
> The symptoms are that some/most/all (it depend, the symptoms are not always 
> the same) of our tasks are showed as _SCHEDULED_ but fail after a timeout. 
> The jobs are them showed a _RUNNING_
> The failing exception is the following one:
> {{Caused by: java.util.concurrent.CompletionException: 
> org.apache.flink.runtime.jobmanager.scheduler.NoResourceAvailableException: 
> Slot request bulk is not fulfillable! Could not allocate the required slot 
> within slot request timeout}}
> After analysis, we assume (we cannot prove it, as there are not that much 
> logs for that part of the code) that the failure is due to a deadlock/race 
> condition that is happening when several jobs are being submitted at the same 
> time to the flink cluster, even though we have enough slots available in the 
> cluster.
> We actually have the error with 52 available task slots, and have 12 jobs 
> that are not scheduled.
> h2. Additional information
>  * Flink version: 1.13.1 commit a7f3192
>  * Flink cluster in session mode
>  * 2 Job managers using k8s HA mode (resource requests: 2 CPU, 4Gb Ram, 
> limits sets on memory to 4Gb)
>  * 50 task managers with 2 slots each (resource requests: 2 CPUs, 2GB Ram. No 
> limits set).
>  * Our Flink cluster is shut down every night, and restarted every morning. 
> The error seems to occur when a lot of jobs needs to be scheduled. The jobs 
> are configured to restore their state, and we do not see any issues for jobs 
> that are being scheduled and run correctly, it seems to really be related to 
> a scheduling issue.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (FLINK-25649) Scheduling jobs fails with org.apache.flink.runtime.jobmanager.scheduler.NoResourceAvailableException

2022-01-17 Thread Gil De Grove (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-25649?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17477031#comment-17477031
 ] 

Gil De Grove commented on FLINK-25649:
--

Hello [~zhuzh] , 

To add to this issue:

1) our jobs are a mix of bounded and unbounded DataStream, we have jobs that 
finished and other that does not. They all have a //ism different from 1 to 20. 
The unbounded DataStream are in fast using the Flink Table API.
2) The cluster does have enough as sometimes we can have all of the jobs 
scheduled. From what we understand, we only need to have an amount of task 
slots available equal or superior to the sum of the max parallelism of all our 
task in our jobs.
For example, one of our unbounded  big job contains 360 tasks with a //ism of 
20, so it means that this jobs should only take 20 task slots.

> Scheduling jobs fails with 
> org.apache.flink.runtime.jobmanager.scheduler.NoResourceAvailableException
> -
>
> Key: FLINK-25649
> URL: https://issues.apache.org/jira/browse/FLINK-25649
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Coordination
>Affects Versions: 1.13.1
>Reporter: Gil De Grove
>Priority: Major
>
> Following comment from Till on this [SO 
> question|https://stackoverflow.com/questions/70683048/scheduling-jobs-fails-with-org-apache-flink-runtime-jobmanager-scheduler-noresou?noredirect=1#comment124980546_70683048]
> h2. *Summary*
> We are currently experiencing a scheduling issue with our flink cluster.
> The symptoms are that some/most/all (it depend, the symptoms are not always 
> the same) of our tasks are showed as _SCHEDULED_ but fail after a timeout. 
> The jobs are them showed a _RUNNING_
> The failing exception is the following one:
> {{Caused by: java.util.concurrent.CompletionException: 
> org.apache.flink.runtime.jobmanager.scheduler.NoResourceAvailableException: 
> Slot request bulk is not fulfillable! Could not allocate the required slot 
> within slot request timeout}}
> After analysis, we assume (we cannot prove it, as there are not that much 
> logs for that part of the code) that the failure is due to a deadlock/race 
> condition that is happening when several jobs are being submitted at the same 
> time to the flink cluster, even though we have enough slots available in the 
> cluster.
> We actually have the error with 52 available task slots, and have 12 jobs 
> that are not scheduled.
> h2. Additional information
>  * Flink version: 1.13.1 commit a7f3192
>  * Flink cluster in session mode
>  * 2 Job managers using k8s HA mode (resource requests: 2 CPU, 4Gb Ram, 
> limits sets on memory to 4Gb)
>  * 50 task managers with 2 slots each (resource requests: 2 CPUs, 2GB Ram. No 
> limits set).
>  * Our Flink cluster is shut down every night, and restarted every morning. 
> The error seems to occur when a lot of jobs needs to be scheduled. The jobs 
> are configured to restore their state, and we do not see any issues for jobs 
> that are being scheduled and run correctly, it seems to really be related to 
> a scheduling issue.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (FLINK-25649) Scheduling jobs fails with org.apache.flink.runtime.jobmanager.scheduler.NoResourceAvailableException

2022-01-16 Thread Zhu Zhu (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-25649?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17476945#comment-17476945
 ] 

Zhu Zhu commented on FLINK-25649:
-

Hi Gil, just to confirm the scenario, 
1. are all the jobs unbounded streaming one(i.e. jobs will not finish and 
cannot return slots to run other jobs)?
2. if so, does the cluster have enough resources for all the jobs to run at the 
same time? 

> Scheduling jobs fails with 
> org.apache.flink.runtime.jobmanager.scheduler.NoResourceAvailableException
> -
>
> Key: FLINK-25649
> URL: https://issues.apache.org/jira/browse/FLINK-25649
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Coordination
>Affects Versions: 1.13.1
>Reporter: Gil De Grove
>Priority: Major
>
> Following comment from Till on this [SO 
> question|https://stackoverflow.com/questions/70683048/scheduling-jobs-fails-with-org-apache-flink-runtime-jobmanager-scheduler-noresou?noredirect=1#comment124980546_70683048]
> h2. *Summary*
> We are currently experiencing a scheduling issue with our flink cluster.
> The symptoms are that some/most/all (it depend, the symptoms are not always 
> the same) of our tasks are showed as _SCHEDULED_ but fail after a timeout. 
> The jobs are them showed a _RUNNING_
> The failing exception is the following one:
> {{Caused by: java.util.concurrent.CompletionException: 
> org.apache.flink.runtime.jobmanager.scheduler.NoResourceAvailableException: 
> Slot request bulk is not fulfillable! Could not allocate the required slot 
> within slot request timeout}}
> After analysis, we assume (we cannot prove it, as there are not that much 
> logs for that part of the code) that the failure is due to a deadlock/race 
> condition that is happening when several jobs are being submitted at the same 
> time to the flink cluster, even though we have enough slots available in the 
> cluster.
> We actually have the error with 52 available task slots, and have 12 jobs 
> that are not scheduled.
> h2. Additional information
>  * Flink version: 1.13.1 commit a7f3192
>  * Flink cluster in session mode
>  * 2 Job managers using k8s HA mode (resource requests: 2 CPU, 4Gb Ram, 
> limits sets on memory to 4Gb)
>  * 50 task managers with 2 slots each (resource requests: 2 CPUs, 2GB Ram. No 
> limits set).
>  * Our Flink cluster is shut down every night, and restarted every morning. 
> The error seems to occur when a lot of jobs needs to be scheduled. The jobs 
> are configured to restore their state, and we do not see any issues for jobs 
> that are being scheduled and run correctly, it seems to really be related to 
> a scheduling issue.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (FLINK-25649) Scheduling jobs fails with org.apache.flink.runtime.jobmanager.scheduler.NoResourceAvailableException

2022-01-13 Thread Till Rohrmann (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-25649?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17475999#comment-17475999
 ] 

Till Rohrmann commented on FLINK-25649:
---

cc [~dmvk]

> Scheduling jobs fails with 
> org.apache.flink.runtime.jobmanager.scheduler.NoResourceAvailableException
> -
>
> Key: FLINK-25649
> URL: https://issues.apache.org/jira/browse/FLINK-25649
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Coordination
>Affects Versions: 1.13.1
>Reporter: Gil De Grove
>Priority: Major
>
> Following comment from Till on this [SO 
> question|https://stackoverflow.com/questions/70683048/scheduling-jobs-fails-with-org-apache-flink-runtime-jobmanager-scheduler-noresou?noredirect=1#comment124980546_70683048]
> h2. *Summary*
> We are currently experiencing a scheduling issue with our flink cluster.
> The symptoms are that some/most/all (it depend, the symptoms are not always 
> the same) of our tasks are showed as _SCHEDULED_ but fail after a timeout. 
> The jobs are them showed a _RUNNING_
> The failing exception is the following one:
> {{Caused by: java.util.concurrent.CompletionException: 
> org.apache.flink.runtime.jobmanager.scheduler.NoResourceAvailableException: 
> Slot request bulk is not fulfillable! Could not allocate the required slot 
> within slot request timeout}}
> After analysis, we assume (we cannot prove it, as there are not that much 
> logs for that part of the code) that the failure is due to a deadlock/race 
> condition that is happening when several jobs are being submitted at the same 
> time to the flink cluster, even though we have enough slots available in the 
> cluster.
> We actually have the error with 52 available task slots, and have 12 jobs 
> that are not scheduled.
> h2. Additional information
>  * Flink version: 1.13.1 commit a7f3192
>  * Flink cluster in session mode
>  * 2 Job managers using k8s HA mode (resource requests: 2 CPU, 4Gb Ram, 
> limits sets on memory to 4Gb)
>  * 50 task managers with 2 slots each (resource requests: 2 CPUs, 2GB Ram. No 
> limits set).
>  * Our Flink cluster is shut down every night, and restarted every morning. 
> The error seems to occur when a lot of jobs needs to be scheduled. The jobs 
> are configured to restore their state, and we do not see any issues for jobs 
> that are being scheduled and run correctly, it seems to really be related to 
> a scheduling issue.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (FLINK-25649) Scheduling jobs fails with org.apache.flink.runtime.jobmanager.scheduler.NoResourceAvailableException

2022-01-13 Thread Gil De Grove (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-25649?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17475386#comment-17475386
 ] 

Gil De Grove commented on FLINK-25649:
--

I'll try to upload Debug llogs as soon as possible, they first need to be 
skimmed of possible confidential informations

> Scheduling jobs fails with 
> org.apache.flink.runtime.jobmanager.scheduler.NoResourceAvailableException
> -
>
> Key: FLINK-25649
> URL: https://issues.apache.org/jira/browse/FLINK-25649
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Coordination
>Affects Versions: 1.13.1
>Reporter: Gil De Grove
>Priority: Major
>
> Following comment from Till on this [SO 
> question|https://stackoverflow.com/questions/70683048/scheduling-jobs-fails-with-org-apache-flink-runtime-jobmanager-scheduler-noresou?noredirect=1#comment124980546_70683048]
> h2. *Summary*
> We are currently experiencing a scheduling issue with our flink cluster.
> The symptoms are that some/most/all (it depend, the symptoms are not always 
> the same) of our tasks are showed as _SCHEDULED_ but fail after a timeout. 
> The jobs are them showed a _RUNNING_
> The failing exception is the following one:
> {{Caused by: java.util.concurrent.CompletionException: 
> org.apache.flink.runtime.jobmanager.scheduler.NoResourceAvailableException: 
> Slot request bulk is not fulfillable! Could not allocate the required slot 
> within slot request timeout}}
> After analysis, we assume (we cannot prove it, as there are not that much 
> logs for that part of the code) that the failure is due to a deadlock/race 
> condition that is happening when several jobs are being submitted at the same 
> time to the flink cluster, even though we have enough slots available in the 
> cluster.
> We actually have the error with 52 available task slots, and have 12 jobs 
> that are not scheduled.
> h2. Additional information
>  * Flink version: 1.13.1 commit a7f3192
>  * Flink cluster in session mode
>  * 2 Job managers using k8s HA mode (resource requests: 2 CPU, 4Gb Ram, 
> limits sets on memory to 4Gb)
>  * 50 task managers with 2 slots each (resource requests: 2 CPUs, 2GB Ram. No 
> limits set).
>  * Our Flink cluster is shut down every night, and restarted every morning. 
> The error seems to occur when a lot of jobs needs to be scheduled. The jobs 
> are configured to restore their state, and we do not see any issues for jobs 
> that are being scheduled and run correctly, it seems to really be related to 
> a scheduling issue.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)