[ https://issues.apache.org/jira/browse/FLINK-29545?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17617613#comment-17617613 ]
Piotr Nowojski edited comment on FLINK-29545 at 10/14/22 9:56 AM: ------------------------------------------------------------------ It looks like the problem is with the upstream sources. What might be happening is that downstream sub tasks are waiting for checkpoint barriers from those remaining sources, that for some reason do not arrive, and this is blocking the alignment process on the downstream subtasks, blocking the progress of the whole job. I don't know, maybe this state would be reported as backpressured. Nevertheless I would dig deeper why, in this screen shot (below), those 4 subtasks haven't finished the checkpoint. You could for example show thread dump from a task manager that is running one of those source subtasks (and tell us what is the name/subtask id of the problematic subtask). Given that you have a custom source, you can also double check if it is implemented correctly. Especially when it comes to acquisition of the checkpoint lock and/or {{CheckpointedFunction#snapshotState}} / {{CheckpointedFunction#initializeState}} methods. I would expect some problem with your implementation of that source. Can you maybe share the source code of that source? !task acknowledge na.png|thumbnail! was (Author: pnowojski): It looks like the problem is with the upstream sources. What might be happening is that downstream sub tasks are waiting for checkpoint barriers from those remaining sources, that for some reason do not arrive, and this is blocking the alignment process on the downstream subtasks, blocking the progress of the whole job. I don't know, maybe this state would be reported as backpressured. Nevertheless I would dig deeper why, in this screen shot !task acknowledge na.png|thumbnail, width=300,height=400!, those 4 subtasks haven't finished the checkpoint. You could for example show thread dump from a task manager that is running one of those source subtasks (and tell us what is the name/subtask id of the problematic subtask). Given that you have a custom source, you can also double check if it is implemented correctly. Especially when it comes to acquisition of the checkpoint lock and/or {{CheckpointedFunction#snapshotState}} / {{CheckpointedFunction#initializeState}} methods. I would expect some problem with your implementation of that source. Can you maybe share the source code of that source? > kafka consuming stop when trigger first checkpoint > -------------------------------------------------- > > Key: FLINK-29545 > URL: https://issues.apache.org/jira/browse/FLINK-29545 > Project: Flink > Issue Type: Bug > Components: Runtime / Checkpointing, Runtime / Network > Affects Versions: 1.13.3 > Reporter: xiaogang zhou > Priority: Critical > Attachments: backpressure 100 busy 0.png, task acknowledge na.png, > task dag.png > > > the task dag is like attached file. the task is started to consume from > earliest offset, it will stop when the first checkpoint triggers. > > is it normal?, for sink is busy 0 and the second operator has 100 backpressure > > and check the checkpoint summary, we can find some of the sub task is n/a. > I tried to debug this issue and found in the > triggerCheckpointAsync , the > triggerCheckpointAsyncInMailbox took a lot time to call > > > looks like this has something to do with > logCheckpointProcessingDelay, Has any fix on this issue? > > > can anybody help me on this issue? > > > > > thanks -- This message was sent by Atlassian Jira (v8.20.10#820010)