Hi Flink folks,
Our team has been working on a Flink service. After completing the service
development, we moved on to the Job Stabilisation exercises at the
production load.
During high load, we see that if the job restarts (mostly due to the
"org.apache.flink.util.FlinkExpectedException: The TaskExecutor is shutting
down"), one of the operators gets stuck in the INITIALISATION state. This
happens even when all the required capacity is present and all the TMs are
up and running. Other operators that have even higher parallelism than this
particular operator initialize fast whilst this particular operator
sometimes takes more than 30 minutes.
We're operating on Flink 1.16.1.

Thank you,
Abhi

Reply via email to