Hi Flink folks, Our team has been working on a Flink service. After completing the service development, we moved on to the Job Stabilisation exercises at the production load. During high load, we see that if the job restarts (mostly due to the "org.apache.flink.util.FlinkExpectedException: The TaskExecutor is shutting down"), one of the operators gets stuck in the INITIALISATION state. This happens even when all the required capacity is present and all the TMs are up and running. Other operators that have even higher parallelism than this particular operator initialize fast whilst this particular operator sometimes takes more than 30 minutes. We're operating on Flink 1.16.1.
Thank you, Abhi