Hi Gil,
could you provide the complete logs (TaskManager & JobManager) for us to
investigate it? The error itself and the behavior you're describing sounds
like expected behavior if there are not enough slots available for all the
submitted jobs to be handled in time. Have you tried increasing the slots
per TaskManager (see taskmanager.numberOfTaskSlots; [1]) or increasing the
number of TaskManager instances?

Best,
Matthias

[1]
https://ci.apache.org/projects/flink/flink-docs-master/docs/deployment/config/#taskmanager-numberoftaskslots

On Wed, Aug 25, 2021 at 9:52 AM Gil De Grove <gil.degr...@euranova.eu>
wrote:

> Hello,
>
> We are struggling a bit with an error in our kubernetes deployment.
>
> The deployment is composed of 2 flink job managers and 58 task managers.
> When deploying the jobs everything is going fine at first, but after the
> deployment of several jobs (mix of batch and streaming job using the SQL
> table API)  we get the same behaviour.
>
> Several jobs are scheduled but never reached the deployed state.
> When looking at the logs we have the attached error message. This happens
> randomly, meaning, it may be when deploying the 8th job, or the 12th job,
> the jobs are always submitted in the same order.
>
> After a while, the jobs that stay in the scheduled state starts to fail
> due to the error.
> If we force a resource manager leader election, by restarting the leader,
> the jobs are rescheduled, and we have other jobs in the same pending state,
> or sometimes, all the jobs are being deployed on the task managers.
>
> Could you help us in the investigation of the issue?
>

Reply via email to