Re: Debug Slowness in Async Checkpointing

Piotr Nowojski Wed, 29 Apr 2020 08:51:23 -0700

Hi,

Yes, for example [1]. Most of the points that you mentioned are already visible 
in the UI and/or via metrics, just take a look at the subtask checkpoint stats.
> when barriers were instrumented at source from checkpoint coordinator
That’s checkpoint trigger time.
> when each down stream task observe first barrier of a chk
In Flink < 1.11 this is implicitly visible via subtracting, sync, async and 
alignment times from end to end checkpoint times. In Flink 1.11+ [2] there 
is/will be an explicit new metric "checkpointStartDelayNanos” (visible in the 
UI as "Start Delay”) for that.
> when list of barriers of a chk arrives to a task
Yes, this alignment time.
> when snapshot start/complete
Yes:


Start - That’s start delay + alignment time, as checkpoint starts immediately 
after the alignment is completed.
Complete - That’s end to end duration
> when ack send to checkpoint coordinator
That’s end to end duration.

One thing which is missing is:
> when upload to remote file system start/complete
As currently that’s just part of async time. I’ve created a ticket to track 
this work [3], so let’s move discussion about this there.

Piotrek

[1] 
https://ci.apache.org/projects/flink/flink-docs-release-1.10/monitoring/checkpoint_monitoring.html
 
<https://ci.apache.org/projects/flink/flink-docs-release-1.10/monitoring/checkpoint_monitoring.html>
[2] https://issues.apache.org/jira/browse/FLINK-15603 
<https://issues.apache.org/jira/browse/FLINK-15603>
[3] https://issues.apache.org/jira/browse/FLINK-17468 
<https://issues.apache.org/jira/browse/FLINK-17468>

> On 25 Apr 2020, at 18:32, Chen Q <qinnc...@gmail.com> wrote:
> 
> Just echo what Lu mentioned, is there documentation we can find more info on
> 
> when barriers were instrumented at source from checkpoint coordinator
> when each down stream task observe first barrier of a chk
> when list of barriers of a chk arrives to a task
> when snapshot start/complete
> when upload to remote file system start/complete
> when ack send to checkpoint coordinator
> For now, we only see checkpoint timeout due to a task can't finish in time in 
> flink UI, seems limited to debug further. 
> 
> Chen
> 
> 
> 
> On 4/24/20 10:52 PM, Congxian Qiu wrote:
>> Hi
>> If the bottleneck is the upload part, did you even have tried upload files 
>> using multithread[1]
>> 
>> [1] https://issues.apache.org/jira/browse/FLINK-11008 
>> <https://issues.apache.org/jira/browse/FLINK-11008>
>> Best,
>> Congxian
>> 
>> 
>> Lu Niu <qqib...@gmail.com <mailto:qqib...@gmail.com>> 于2020年4月24日周五 
>> 下午12:38写道：
>> Hi, Robert
>> 
>> Thanks for relying. Yeah. After I added monitoring on the above path, it 
>> shows the slowness did come from uploading file to s3. Right now I am still 
>> investigating the issue. At the same time, I am trying PrestoS3FileSystem to 
>> check whether that can mitigate the problem.
>> 
>> Best
>> Lu
>> 
>> On Thu, Apr 23, 2020 at 8:10 AM Robert Metzger <rmetz...@apache.org 
>> <mailto:rmetz...@apache.org>> wrote:
>> Hi Lu,
>> 
>> were you able to resolve the issue with the slow async checkpoints?
>> 
>> I've added Yu Li to this thread. He has more experience with the state 
>> backends to decide which monitoring is appropriate for such situations.
>> 
>> Best,
>> Robert
>> 
>> 
>> On Tue, Apr 21, 2020 at 10:50 PM Lu Niu <qqib...@gmail.com 
>> <mailto:qqib...@gmail.com>> wrote:
>> Hi, Robert
>> 
>> Thanks for replying. To improve observability , do you think we should 
>> expose more metrics in checkpointing? for example, in incremental 
>> checkpoint, the time spend on uploading sst files? 
>> https://github.com/apache/flink/blob/5b71c7f2fe36c760924848295a8090898cb10f15/flink-state-backends/flink-statebackend-rocksdb/src/main/java/org/apache/flink/contrib/streaming/state/snapshot/RocksIncrementalSnapshotStrategy.java#L319
>>  
>> <https://github.com/apache/flink/blob/5b71c7f2fe36c760924848295a8090898cb10f15/flink-state-backends/flink-statebackend-rocksdb/src/main/java/org/apache/flink/contrib/streaming/state/snapshot/RocksIncrementalSnapshotStrategy.java#L319>
>> 
>> Best
>> Lu
>> 
>> 
>> On Fri, Apr 17, 2020 at 11:31 AM Robert Metzger <rmetz...@apache.org 
>> <mailto:rmetz...@apache.org>> wrote:
>> Hi,
>> did you check the TaskManager logs if there are retries by the s3a file 
>> system during checkpointing?
>> 
>> I'm not aware of any metrics in Flink that could be helpful in this 
>> situation.
>> 
>> Best,
>> Robert
>> 
>> On Tue, Apr 14, 2020 at 12:02 AM Lu Niu <qqib...@gmail.com 
>> <mailto:qqib...@gmail.com>> wrote:
>> Hi, Flink users
>> 
>> We notice sometimes async checkpointing can be extremely slow, leading to 
>> checkpoint timeout. For example, For a state size around 2.5MB, it could 
>> take 7~12min in async checkpointing:
>> 
>> <Screen Shot 2020-04-09 at 5.04.30 PM.png>
>> 
>> Notice all the slowness comes from async checkpointing, no delay in sync 
>> part and barrier assignment. As we use rocksdb incremental checkpointing, I 
>> notice the slowness might be caused by uploading the file to s3. However, I 
>> am not completely sure since there are other steps in async checkpointing. 
>> Does flink expose fine-granular metrics to debug such slowness? 
>> 
>> setup: flink 1.9.1, rocksdb incremental state backend, S3AHaoopFileSystem
>> 
>> Best
>> Lu

Re: Debug Slowness in Async Checkpointing

Reply via email to