Kartikey Pant created FLINK-35319:
-------------------------------------
Summary: Specific Operator getting stuck in Initialization for a
long time
Key: FLINK-35319
URL: https://issues.apache.org/jira/browse/FLINK-35319
Project: Flink
Issue Type: Bug
Affects Versions: 1.16.1
Reporter: Kartikey Pant
Attachments: Screenshot 2024-05-09 at 2.19.10 PM.png, image.png
Our team has been working on a Flink service. After completing the service
development, we moved on to the Job Stabilisation exercises at the production
load.
During high load, we see that if the job restarts (mostly due to the
"org.apache.flink.util.FlinkExpectedException: The TaskExecutor is shutting
down"), one of the operators gets stuck in the INITIALISATION state.
This happens even when all the required capacity is present and all the TMs are
up and running. Other operators that have even higher parallelism than this
particular operator initialise fast whilst this particular operator sometimes
takes more than 30 minutes.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)