Hi,
did you check the TaskManager logs if there are retries by the s3a file
system during checkpointing?

I'm not aware of any metrics in Flink that could be helpful in this
situation.

Best,
Robert

On Tue, Apr 14, 2020 at 12:02 AM Lu Niu <qqib...@gmail.com> wrote:

> Hi, Flink users
>
> We notice sometimes async checkpointing can be extremely slow, leading to
> checkpoint timeout. For example, For a state size around 2.5MB, it could
> take 7~12min in async checkpointing:
>
> [image: Screen Shot 2020-04-09 at 5.04.30 PM.png]
>
> Notice all the slowness comes from async checkpointing, no delay in sync
> part and barrier assignment. As we use rocksdb incremental checkpointing, I
> notice the slowness might be caused by uploading the file to s3. However, I
> am not completely sure since there are other steps in async checkpointing.
> Does flink expose fine-granular metrics to debug such slowness?
>
> setup: flink 1.9.1, rocksdb incremental state backend, S3AHaoopFileSystem
>
> Best
> Lu
>

Reply via email to