I should probably clarify that this is intermittent and it is a different
subtask ID each time it does happen.

On Thu, May 5, 2022 at 4:25 PM Ammon Diether <adiet...@gmail.com> wrote:

> Flink Stateful Functions 3.2.0  (Flink 1.14.3)
> All java embedded code.
> Parallelism 32
> Standard Stateful Functions Tasks:  router -> functions -> feedback
>
> The Router reads from kinesis and routes to stateful functions.  For some
> reason, one and only one of the router subtasks will have have a start
> delay around 60 seconds to 120 seconds.   All the other router subtasks
> will be 307ms.  During the 120 seconds, all the routers will stop routing
> (looks like backpressure), after the checkpoint is complete the routers
> will surge read and catch up.
>
> I also get these warnings in some of the taskmanager logs.
>
> 2022-05-05 13:43:14,118 WARN
>>  org.apache.flink.streaming.runtime.tasks.SubtaskCheckpointCoordinatorImpl
>> [] - Time from receiving all checkpoint barriers/RPC to executing it
>> exceeded threshold: 132017ms
>>
>
> I am guessing now:  It sure seems that one of the router subtasks gets
> behind, the checkpoint barrier gets sent to the subtask but it takes
> forever for it to process through it.
>
> Any thoughts/insights/suggestions would be appreciated.
>
> [image: image.png]
>

Reply via email to