Hi, Yes, for example [1]. Most of the points that you mentioned are already visible in the UI and/or via metrics, just take a look at the subtask checkpoint stats. > when barriers were instrumented at source from checkpoint coordinator That’s checkpoint trigger time. > when each down stream task observe first barrier of a chk In Flink < 1.11 this is implicitly visible via subtracting, sync, async and alignment times from end to end checkpoint times. In Flink 1.11+ [2] there is/will be an explicit new metric "checkpointStartDelayNanos” (visible in the UI as "Start Delay”) for that. > when list of barriers of a chk arrives to a task Yes, this alignment time. > when snapshot start/complete Yes:
Start - That’s start delay + alignment time, as checkpoint starts immediately after the alignment is completed. Complete - That’s end to end duration > when ack send to checkpoint coordinator That’s end to end duration. One thing which is missing is: > when upload to remote file system start/complete As currently that’s just part of async time. I’ve created a ticket to track this work [3], so let’s move discussion about this there. Piotrek [1] https://ci.apache.org/projects/flink/flink-docs-release-1.10/monitoring/checkpoint_monitoring.html <https://ci.apache.org/projects/flink/flink-docs-release-1.10/monitoring/checkpoint_monitoring.html> [2] https://issues.apache.org/jira/browse/FLINK-15603 <https://issues.apache.org/jira/browse/FLINK-15603> [3] https://issues.apache.org/jira/browse/FLINK-17468 <https://issues.apache.org/jira/browse/FLINK-17468> > On 25 Apr 2020, at 18:32, Chen Q <qinnc...@gmail.com> wrote: > > Just echo what Lu mentioned, is there documentation we can find more info on > > when barriers were instrumented at source from checkpoint coordinator > when each down stream task observe first barrier of a chk > when list of barriers of a chk arrives to a task > when snapshot start/complete > when upload to remote file system start/complete > when ack send to checkpoint coordinator > For now, we only see checkpoint timeout due to a task can't finish in time in > flink UI, seems limited to debug further. > > Chen > > > > On 4/24/20 10:52 PM, Congxian Qiu wrote: >> Hi >> If the bottleneck is the upload part, did you even have tried upload files >> using multithread[1] >> >> [1] https://issues.apache.org/jira/browse/FLINK-11008 >> <https://issues.apache.org/jira/browse/FLINK-11008> >> Best, >> Congxian >> >> >> Lu Niu <qqib...@gmail.com <mailto:qqib...@gmail.com>> 于2020年4月24日周五 >> 下午12:38写道: >> Hi, Robert >> >> Thanks for relying. Yeah. After I added monitoring on the above path, it >> shows the slowness did come from uploading file to s3. Right now I am still >> investigating the issue. At the same time, I am trying PrestoS3FileSystem to >> check whether that can mitigate the problem. >> >> Best >> Lu >> >> On Thu, Apr 23, 2020 at 8:10 AM Robert Metzger <rmetz...@apache.org >> <mailto:rmetz...@apache.org>> wrote: >> Hi Lu, >> >> were you able to resolve the issue with the slow async checkpoints? >> >> I've added Yu Li to this thread. He has more experience with the state >> backends to decide which monitoring is appropriate for such situations. >> >> Best, >> Robert >> >> >> On Tue, Apr 21, 2020 at 10:50 PM Lu Niu <qqib...@gmail.com >> <mailto:qqib...@gmail.com>> wrote: >> Hi, Robert >> >> Thanks for replying. To improve observability , do you think we should >> expose more metrics in checkpointing? for example, in incremental >> checkpoint, the time spend on uploading sst files? >> https://github.com/apache/flink/blob/5b71c7f2fe36c760924848295a8090898cb10f15/flink-state-backends/flink-statebackend-rocksdb/src/main/java/org/apache/flink/contrib/streaming/state/snapshot/RocksIncrementalSnapshotStrategy.java#L319 >> >> <https://github.com/apache/flink/blob/5b71c7f2fe36c760924848295a8090898cb10f15/flink-state-backends/flink-statebackend-rocksdb/src/main/java/org/apache/flink/contrib/streaming/state/snapshot/RocksIncrementalSnapshotStrategy.java#L319> >> >> Best >> Lu >> >> >> On Fri, Apr 17, 2020 at 11:31 AM Robert Metzger <rmetz...@apache.org >> <mailto:rmetz...@apache.org>> wrote: >> Hi, >> did you check the TaskManager logs if there are retries by the s3a file >> system during checkpointing? >> >> I'm not aware of any metrics in Flink that could be helpful in this >> situation. >> >> Best, >> Robert >> >> On Tue, Apr 14, 2020 at 12:02 AM Lu Niu <qqib...@gmail.com >> <mailto:qqib...@gmail.com>> wrote: >> Hi, Flink users >> >> We notice sometimes async checkpointing can be extremely slow, leading to >> checkpoint timeout. For example, For a state size around 2.5MB, it could >> take 7~12min in async checkpointing: >> >> <Screen Shot 2020-04-09 at 5.04.30 PM.png> >> >> Notice all the slowness comes from async checkpointing, no delay in sync >> part and barrier assignment. As we use rocksdb incremental checkpointing, I >> notice the slowness might be caused by uploading the file to s3. However, I >> am not completely sure since there are other steps in async checkpointing. >> Does flink expose fine-granular metrics to debug such slowness? >> >> setup: flink 1.9.1, rocksdb incremental state backend, S3AHaoopFileSystem >> >> Best >> Lu