Bulk Scheduler timeout when creating several jobs in flink kubernetes HA deployment

Gil De Grove Wed, 25 Aug 2021 00:52:09 -0700

Hello,

We are struggling a bit with an error in our kubernetes deployment.


The deployment is composed of 2 flink job managers and 58 task managers.
When deploying the jobs everything is going fine at first, but after the
deployment of several jobs (mix of batch and streaming job using the SQL
table API)  we get the same behaviour.

Several jobs are scheduled but never reached the deployed state.
When looking at the logs we have the attached error message. This happens
randomly, meaning, it may be when deploying the 8th job, or the 12th job,
the jobs are always submitted in the same order.

After a while, the jobs that stay in the scheduled state starts to fail due
to the error.
If we force a resource manager leader election, by restarting the leader,
the jobs are rescheduled, and we have other jobs in the same pending state,
or sometimes, all the jobs are being deployed on the task managers.

Could you help us in the investigation of the issue?

jobmanager-log.log
Description: Binary data

Bulk Scheduler timeout when creating several jobs in flink kubernetes HA deployment

Reply via email to