Hi Biao,
Thank you for your response. We have tried looking into Thread dumps of
Task Managers before but that's not helping our case.
We see that even when all the Taskslots of that particular operator are
stuck in an INITIALISING state, many of them have already started
processing new data.
Is there any other way we can approach this?

On 2024/05/06 03:54:04 Biao Geng wrote:
> Hi Abhi,
>
> If your case can be reproduced steadily, have your ever tried to get
> the thread dump of the TM which the problematic operator resides in?
> Maybe we can get more clues with the thread dump to see where the
> operator is getting stuck.
>
> Best,
> Biao Geng
>
> Abhi Sagar Khatri via user <us...@flink.apache.org> 于2024年4月30日周二 19:38写道:
> >
> > Some more context: Our job graph has 5 different Tasks/operators/flink
functions of which we are seeing this issue every time in a particular
operator
> > We’re using Unaligned checkpoints. With aligned checkpoint we don’t see
this issue but the checkpoint duration in that case is very high and causes
timeouts.
> >
> > On Tue, Apr 30, 2024 at 3:05 PM Abhi Sagar Khatri <a....@salesforce.com>
wrote:
> >>
> >> Hi Flink folks,
> >> Our team has been working on a Flink service. After completing the
service development, we moved on to the Job Stabilisation exercises at the
production load.
> >> During high load, we see that if the job restarts (mostly due to the
"org.apache.flink.util.FlinkExpectedException: The TaskExecutor is shutting
down"), one of the operators gets stuck in the INITIALISATION state. This
happens even when all the required capacity is present and all the TMs are
up and running. Other operators that have even higher parallelism than this
particular operator initialize fast whilst this particular operator
sometimes takes more than 30 minutes.
> >> We're operating on Flink 1.16.1.
> >>
> >> Thank you,
> >> Abhi
>

Reply via email to