Thank you for the help. To follow up, the issue went away when we reverted
back to flink 1.13. May be related to flink-27481. Before reverting, we
tested unaligned checkpoints with a timeout of 10 minutes, which timed out.
Thanks.
On Thu, Apr 28, 2022, 5:38 PM Guowei Ma wrote:
> Hi Sam
>
> I think the first step is to see which part of your Flink APP is blocking
> the completion of Checkpoint. Specifically, you can refer to the
> "Checkpoint Details" section of the document [1]. Using these methods, you
> should be able to observe where the checkpoint is blocked, for example, it
> may be an agg operator of the app, or it may be blocked on the sink of
> kafka.
> Once you know which operator is blocking, you can use FlameGraph [2] to
> see where the bottleneck of the operator is. Then do specific operations.
>
> Hope these help!
> [1]
> https://nightlies.apache.org/flink/flink-docs-master/docs/ops/monitoring/checkpoint_monitoring/#checkpoint-details
> [2]
> https://nightlies.apache.org/flink/flink-docs-master/docs/ops/debugging/flame_graphs/
>
> Best,
> Guowei
>
>
> On Fri, Apr 29, 2022 at 2:10 AM Sam Ch wrote:
>
>> Hello,
>>
>> I am running into checkpoint timeouts and am looking for guidance on
>> troubleshooting. What should I be looking at? What configuration parameters
>> would affect this? I am afraid I am a Flink newbie so I am still picking up
>> the concepts. Additional notes are below, anything else I can provide?
>> Thanks.
>>
>>
>> The checkpoint size is small (less than 100kB)
>> Multiple flink apps are running on a cluster, only one is running into
>> checkpoint timeouts
>> Timeout is set to 10 mins
>> Tried aligned and unaligned checkpoints
>> Tried clearing checkpoints to start fresh
>> Plenty of disk space
>> Dataflow: kafka source -> flink app -> kafka sink
>>
>