Re: Flink-1.10.0 source的checkpoint偶尔时间比较长

2020-06-29 文章 Tianwang Li
>
> 偶尔会出现checkpint需要超过30分钟的(出现的频率不高,1~2天1次)

找到原因了, 任务处理延迟比较大,kafka数据过期清理了,导致从last消费(watermark一下子增长了好多个小时),
然后,这个时候需要输出几个小时内的所有窗口(平时一次只输出一个窗口,这时一次要输出30个窗口消耗比较长时间)。

因为是稳定测试任务,没有关注kafka 延迟 导致数据过期到问题。

感谢,zhisheng、LakeShen、Yichao Yang。


Yichao Yang <1048262...@qq.com> 于2020年6月29日周一 下午7:58写道:

> Hi
>
>
>
> 如果任务会经常出现反压的话,可以先解决反压问题,因为任务反压也会影响checkpoint,其次就是可以关注一下你的作业的物理资源指标,比如cpu使用率,内存使用率,gc频率是否特别高,尝试保障物理资源使用率在正常水平。
>
>
> Best,
> Yichao Yang
>
>
>
>
> --原始邮件--
> 发件人:"Tianwang Li" 发送时间:2020年6月28日(星期天) 上午10:17
> 收件人:"user-zh"
> 主题:Flink-1.10.0 source的checkpoint偶尔时间比较长
>
>
>
> 关于Flink checkpoint偶尔会比较长时间的问题。
> 环境与背景:
> 版本:flink1.10.0
> 数据量:每秒约10万左右的记录,数据源是kafka
> 计算逻辑:滑动窗口统计,每个窗口输出的规模大概1~2千万记录。
> 是否有反压:任务经常会出现反压(特别是在窗口输出的时候)。
>
>
> 问题:
>  
> 大部分时候checkpoint都是在1分钟内完成,偶尔会出现checkpint需要超过30分钟的(出现的频率不高,1~2天1次)。
> source的checkpoint消耗的时间比较长。Triggercheckpoint 到Starting
> checkpoint消耗时间比较长。
>
>
> checkpoint情况大致如下:
>
>
>
>
>
>
>
>
>
>
>
>
>
>
> 2020-06-24 21:09:53,369 DEBUG
> org.apache.flink.runtime.taskexecutor.TaskExecutor  
>- Trigger checkpoint 316@1593004193363 for
> 84dce1ec8aa5a4df2d1758d6e9278693.
>
> 2020-06-24 21:09:58,327 DEBUG
> org.apache.flink.runtime.taskexecutor.TaskExecutor  
>- Received heartbeat request from
> e88ea2f790430c9c160e540ef0546d60.
>
> 2020-06-24 21:09:59,266 DEBUG
> org.apache.flink.runtime.taskexecutor.TaskExecutor  
>- Received heartbeat request from
> b93d7167db364dfdcbda886944f1482f.
>
> 2020-06-24 21:09:59,686 INFO
> org.apache.flink.runtime.taskexecutor.TaskManagerRunner  
>  - Memory usage stats: [HEAP: 202/2573/2573 MB, NON HEAP: 111/114/424
> MB (used/committed/max)]
>
> 2020-06-24 21:09:59,686 INFO
> org.apache.flink.runtime.taskexecutor.TaskManagerRunner  
>  - Direct memory stats: Count: 17403, Total Capacity: 583911423, Used
> Memory: 583911424
>
> 2020-06-24 21:09:59,686 INFO
> org.apache.flink.runtime.taskexecutor.TaskManagerRunner  
>  - Off-heap pool stats: [Code Cache: 35/35/240 MB
> (used/committed/max)], [Metaspace: 67/69/96 MB (used/committed/max)],
> [Compressed Class Space: 8/9/88 MB (used/committed/max)]
>
> 2020-06-24 21:09:59,686 INFO
> org.apache.flink.runtime.taskexecutor.TaskManagerRunner  
>  - Garbage collector stats: [PS Scavenge, GC TIME (ms): 108643, GC
> COUNT: 6981], [PS MarkSweep, GC TIME (ms): 1074, GC COUNT: 6]
>
> 2020-06-24 21:10:08,346 DEBUG
> org.apache.flink.runtime.taskexecutor.TaskExecutor  
>- Received heartbeat request from
> e88ea2f790430c9c160e540ef0546d60.
>
> 2020-06-24 21:10:09,286 DEBUG
> org.apache.flink.runtime.taskexecutor.TaskExecutor  
>- Received heartbeat request from
> b93d7167db364dfdcbda886944f1482f.
>
> 2020-06-24 21:10:09,686 INFO
> org.apache.flink.runtime.taskexecutor.TaskManagerRunner  
>  - Memory usage stats: [HEAP: 557/2573/2573 MB, NON HEAP: 111/114/424
> MB (used/committed/max)]
>
> 2020-06-24 21:10:09,686 INFO
> org.apache.flink.runtime.taskexecutor.TaskManagerRunner  
>  - Direct memory stats: Count: 17403, Total Capacity: 583911423, Used
> Memory: 583911424
>
> 2020-06-24 21:10:09,686 INFO
> org.apache.flink.runtime.taskexecutor.TaskManagerRunner  
>  - Off-heap pool stats: [Code Cache: 35/35/240 MB
> (used/committed/max)], [Metaspace: 67/69/96 MB (used/committed/max)],
> [Compressed Class Space: 8/9/88 MB (used/committed/max)]
>
> 2020-06-24 21:10:09,686 INFO
> org.apache.flink.runtime.taskexecutor.TaskManagerRunner  
>  - Garbage collector stats: [PS Scavenge, GC TIME (ms): 108643, GC
> COUNT: 6981], [PS MarkSweep, GC TIME (ms): 1074, GC COUNT: 6]
>
>
>
>
> 省略
>
>
>
>
> 2020-06-24 21:55:39,875 INFO
> org.apache.flink.runtime.taskexecutor.TaskManagerRunner  
>  - Direct memory stats: Count: 17403, Total Capacity: 583911423, Used
> Memory: 583911424
>
> 2020-06-24 21:55:39,875 INFO
> org.apache.flink.runtime.taskexecutor.TaskManagerRunner  
>  - Off-heap pool stats: [Code Cache: 35/35/240 MB
> (used/committed/max)], [Metaspace: 67/69/96 MB (used/committed/max)],
> [Compressed Class Space: 8/9/88 MB (used/committed/max)]
>
> 2020-06-24 21:55:39,876 INFO
> org.apache.flink.runtime.taskexecutor.TaskManagerRunner  
>  - Garbage collector stats: [PS Scavenge, GC TIME (ms): 110520, GC
> COUNT: 7083], [PS MarkSweep, GC TIME (ms): 1074, GC COUNT: 6]
>
> 2020-06-24 21:55:41,721 DEBUG
> org.apache.flink.streaming.runtime.tasks.StreamTask   
>   - Starting checkpoint (316) CHECKPOINT on task Source: Custom
> Source - Map - Filter - Timestamps/Watermarks (4/10)
>
> 2020-06-24 21:55:41,721 DEBUG
> o

Re: Flink-1.10.0 source的checkpoint偶尔时间比较长

2020-06-28 文章 Tianwang Li
>
> 偶尔一两天出现 Checkpoint 超时,看下你的任务中,是否可能存在某类 key 在这一两天会突然增多的情况。
>
我增加每个task处理窗口数据的时间在观察一下,
我这个是测试任务,没有sink输出。
source -> window -> window(统计上一个窗口的输出的记录数,pint 10记录左右)

LakeShen  于2020年6月28日周日 上午10:35写道:

> Hi Tianwang Li,
>
> 偶尔一两天出现 Checkpoint 超时,看下你的任务中,是否可能存在某类 key 在这一两天会突然增多的情况。
>
> Best,
> LakeShen
>
> zhisheng  于2020年6月28日周日 上午10:27写道:
>
> > hi, Tianwang Li
> >
> > 看到有三个图片挂了,可以试着把图片上传到第三方的图床,然后贴个链接过来,另外:
> >
> > > 任务经常会出现反压(特别是在窗口输出的时候)
> >
> > 这个检查一下窗口下游算子的情况,比如是不是窗口输出的数据过多,而 sink 的并发还和之前的保持一致,导致处理速度跟不上,从而导致的反压。
> >
> >
> > > 大部分时候checkpoint都是在1分钟内完成,偶尔会出现checkpint需要超过30分钟的(出现的频率不高,1~2天1次)
> >
> > 这种也可以看看是不是 HDFS 有时候压力比较大导致的出现毛刺现象
> >
> > 另外建议补一下 UI 上 chekcpoint 相关的截图和日志信息,这样才能更好的定位问题。
> >
> >
> > Best !
> > zhisheng
> >
> >
> > Tianwang Li  于2020年6月28日周日 上午10:17写道:
> >
> > > 关于Flink checkpoint偶尔会比较长时间的问题。
> > >
> > > *环境与背景:*
> > > 版本:flink1.10.0
> > > 数据量:每秒约10万左右的记录,数据源是kafka
> > > 计算逻辑:滑动窗口统计,每个窗口输出的规模大概1~2千万记录。
> > > 是否有反压:任务经常会出现反压(特别是在窗口输出的时候)。
> > >
> > > *问题:*
> > > 大部分时候checkpoint都是在1分钟内完成,偶尔会出现checkpint需要超过30分钟的(出现的频率不高,1~2天1次)。
> > > source的checkpoint消耗的时间比较长。Trigger checkpoint 到 Starting
> > checkpoint消耗时间比较长。
> > >
> > > checkpoint情况大致如下:
> > >
> > > [image: image.png]
> > > [image: image.png]
> > > [image: image.png]
> > >
> > > 2020-06-24 21:09:53,369 DEBUG
> > > org.apache.flink.runtime.taskexecutor.TaskExecutor- Trigger
> > > checkpoint 316@1593004193363 for 84dce1ec8aa5a4df2d1758d6e9278693.
> > >
> > > 2020-06-24 21:09:58,327 DEBUG
> > > org.apache.flink.runtime.taskexecutor.TaskExecutor-
> Received
> > > heartbeat request from e88ea2f790430c9c160e540ef0546d60.
> > >
> > > 2020-06-24 21:09:59,266 DEBUG
> > > org.apache.flink.runtime.taskexecutor.TaskExecutor-
> Received
> > > heartbeat request from b93d7167db364dfdcbda886944f1482f.
> > >
> > > 2020-06-24 21:09:59,686 INFO
> > org.apache.flink.runtime.taskexecutor.TaskManagerRunner
> > >   - Memory usage stats: [HEAP: 202/2573/2573 MB, NON HEAP:
> > > 111/114/424 MB (used/committed/max)]
> > >
> > > 2020-06-24 21:09:59,686 INFO
> > org.apache.flink.runtime.taskexecutor.TaskManagerRunner
> > >   - Direct memory stats: Count: 17403, Total Capacity: 583911423,
> > > Used Memory: 583911424
> > >
> > > 2020-06-24 21:09:59,686 INFO
> > org.apache.flink.runtime.taskexecutor.TaskManagerRunner
> > >   - Off-heap pool stats: [Code Cache: 35/35/240 MB
> > > (used/committed/max)], [Metaspace: 67/69/96 MB (used/committed/max)],
> > > [Compressed Class Space: 8/9/88 MB (used/committed/max)]
> > >
> > > 2020-06-24 21:09:59,686 INFO
> > org.apache.flink.runtime.taskexecutor.TaskManagerRunner
> > >   - Garbage collector stats: [PS Scavenge, GC TIME (ms): 108643, GC
> > > COUNT: 6981], [PS MarkSweep, GC TIME (ms): 1074, GC COUNT: 6]
> > >
> > > 2020-06-24 21:10:08,346 DEBUG
> > > org.apache.flink.runtime.taskexecutor.TaskExecutor-
> Received
> > > heartbeat request from e88ea2f790430c9c160e540ef0546d60.
> > >
> > > 2020-06-24 21:10:09,286 DEBUG
> > > org.apache.flink.runtime.taskexecutor.TaskExecutor-
> Received
> > > heartbeat request from b93d7167db364dfdcbda886944f1482f.
> > >
> > > 2020-06-24 21:10:09,686 INFO
> > org.apache.flink.runtime.taskexecutor.TaskManagerRunner
> > >   - Memory usage stats: [HEAP: 557/2573/2573 MB, NON HEAP:
> > > 111/114/424 MB (used/committed/max)]
> > >
> > > 2020-06-24 21:10:09,686 INFO
> > org.apache.flink.runtime.taskexecutor.TaskManagerRunner
> > >   - Direct memory stats: Count: 17403, Total Capacity: 583911423,
> > > Used Memory: 583911424
> > >
> > > 2020-06-24 21:10:09,686 INFO
> > org.apache.flink.runtime.taskexecutor.TaskManagerRunner
> > >   - Off-heap pool stats: [Code Cache: 35/35/240 MB
> > > (used/committed/max)], [Metaspace: 67/69/96 MB (used/committed/max)],
> > > [Compressed Class Space: 8/9/88 MB (used/committed/max)]
> > >
> > > 2020-06-24 21:10:09,686 INFO
> > org.apache.flink.runtime.taskexecutor.TaskManagerRunner
> > >   - Garbage collector stats: [PS Scavenge, GC TIME (ms): 108643, GC
> > > COUNT: 6981], [PS MarkSweep, GC TIME (ms): 1074, GC COUNT: 6]
> > >
> > >
> > > 省略
> > >
> > >
> > > 2020-06-24 21:55:39,875 INFO
> > org.apache.flink.runtime.taskexecutor.TaskManagerRunner
> > >   - Direct memory stats: Count: 17403, Total Capacity: 583911423,
> > > Used Memory: 583911424
> > >
> > > 2020-06-24 21:55:39,875 INFO
> > org.apache.flink.runtime.taskexecutor.TaskManagerRunner
> > >   - Off-heap pool stats: [Code Cache: 35/35/240 MB
> > > (used/committed/max)], [Metaspace: 67/69/96 MB (used/committed/max)],
> > > [Compressed Class Space: 8/9/88 MB (used/committed/max)]
> > >
> > > 2020-06-24 21:55:39,876 INFO
> > org.apache.flink.runtime.taskexecutor.TaskManagerRunner
> > >   - Garbage collector stats: [PS Scavenge, GC TIME (ms): 110520, GC
> > > COUNT: 7083], [PS MarkSweep, GC TIME (ms): 1074, GC COUNT: 6]
> > >
> > > 2020-06-24 21:55:41,721 DEBUG
> > > 

Re: Flink-1.10.0 source的checkpoint偶尔时间比较长

2020-06-27 文章 Tianwang Li
我补充一下,checkpoint的UI截图如下:


https://imgchr.com/i/NgCUgS

https://imgchr.com/i/NgChDJ

https://imgchr.com/i/NgCT4x



>

-- 
**
 tivanli
**


Re: Flink-1.10.0 source的checkpoint偶尔时间比较长

2020-06-27 文章 LakeShen
Hi Tianwang Li,

偶尔一两天出现 Checkpoint 超时,看下你的任务中,是否可能存在某类 key 在这一两天会突然增多的情况。

Best,
LakeShen

zhisheng  于2020年6月28日周日 上午10:27写道:

> hi, Tianwang Li
>
> 看到有三个图片挂了,可以试着把图片上传到第三方的图床,然后贴个链接过来,另外:
>
> > 任务经常会出现反压(特别是在窗口输出的时候)
>
> 这个检查一下窗口下游算子的情况,比如是不是窗口输出的数据过多,而 sink 的并发还和之前的保持一致,导致处理速度跟不上,从而导致的反压。
>
>
> > 大部分时候checkpoint都是在1分钟内完成,偶尔会出现checkpint需要超过30分钟的(出现的频率不高,1~2天1次)
>
> 这种也可以看看是不是 HDFS 有时候压力比较大导致的出现毛刺现象
>
> 另外建议补一下 UI 上 chekcpoint 相关的截图和日志信息,这样才能更好的定位问题。
>
>
> Best !
> zhisheng
>
>
> Tianwang Li  于2020年6月28日周日 上午10:17写道:
>
> > 关于Flink checkpoint偶尔会比较长时间的问题。
> >
> > *环境与背景:*
> > 版本:flink1.10.0
> > 数据量:每秒约10万左右的记录,数据源是kafka
> > 计算逻辑:滑动窗口统计,每个窗口输出的规模大概1~2千万记录。
> > 是否有反压:任务经常会出现反压(特别是在窗口输出的时候)。
> >
> > *问题:*
> > 大部分时候checkpoint都是在1分钟内完成,偶尔会出现checkpint需要超过30分钟的(出现的频率不高,1~2天1次)。
> > source的checkpoint消耗的时间比较长。Trigger checkpoint 到 Starting
> checkpoint消耗时间比较长。
> >
> > checkpoint情况大致如下:
> >
> > [image: image.png]
> > [image: image.png]
> > [image: image.png]
> >
> > 2020-06-24 21:09:53,369 DEBUG
> > org.apache.flink.runtime.taskexecutor.TaskExecutor- Trigger
> > checkpoint 316@1593004193363 for 84dce1ec8aa5a4df2d1758d6e9278693.
> >
> > 2020-06-24 21:09:58,327 DEBUG
> > org.apache.flink.runtime.taskexecutor.TaskExecutor- Received
> > heartbeat request from e88ea2f790430c9c160e540ef0546d60.
> >
> > 2020-06-24 21:09:59,266 DEBUG
> > org.apache.flink.runtime.taskexecutor.TaskExecutor- Received
> > heartbeat request from b93d7167db364dfdcbda886944f1482f.
> >
> > 2020-06-24 21:09:59,686 INFO
> org.apache.flink.runtime.taskexecutor.TaskManagerRunner
> >   - Memory usage stats: [HEAP: 202/2573/2573 MB, NON HEAP:
> > 111/114/424 MB (used/committed/max)]
> >
> > 2020-06-24 21:09:59,686 INFO
> org.apache.flink.runtime.taskexecutor.TaskManagerRunner
> >   - Direct memory stats: Count: 17403, Total Capacity: 583911423,
> > Used Memory: 583911424
> >
> > 2020-06-24 21:09:59,686 INFO
> org.apache.flink.runtime.taskexecutor.TaskManagerRunner
> >   - Off-heap pool stats: [Code Cache: 35/35/240 MB
> > (used/committed/max)], [Metaspace: 67/69/96 MB (used/committed/max)],
> > [Compressed Class Space: 8/9/88 MB (used/committed/max)]
> >
> > 2020-06-24 21:09:59,686 INFO
> org.apache.flink.runtime.taskexecutor.TaskManagerRunner
> >   - Garbage collector stats: [PS Scavenge, GC TIME (ms): 108643, GC
> > COUNT: 6981], [PS MarkSweep, GC TIME (ms): 1074, GC COUNT: 6]
> >
> > 2020-06-24 21:10:08,346 DEBUG
> > org.apache.flink.runtime.taskexecutor.TaskExecutor- Received
> > heartbeat request from e88ea2f790430c9c160e540ef0546d60.
> >
> > 2020-06-24 21:10:09,286 DEBUG
> > org.apache.flink.runtime.taskexecutor.TaskExecutor- Received
> > heartbeat request from b93d7167db364dfdcbda886944f1482f.
> >
> > 2020-06-24 21:10:09,686 INFO
> org.apache.flink.runtime.taskexecutor.TaskManagerRunner
> >   - Memory usage stats: [HEAP: 557/2573/2573 MB, NON HEAP:
> > 111/114/424 MB (used/committed/max)]
> >
> > 2020-06-24 21:10:09,686 INFO
> org.apache.flink.runtime.taskexecutor.TaskManagerRunner
> >   - Direct memory stats: Count: 17403, Total Capacity: 583911423,
> > Used Memory: 583911424
> >
> > 2020-06-24 21:10:09,686 INFO
> org.apache.flink.runtime.taskexecutor.TaskManagerRunner
> >   - Off-heap pool stats: [Code Cache: 35/35/240 MB
> > (used/committed/max)], [Metaspace: 67/69/96 MB (used/committed/max)],
> > [Compressed Class Space: 8/9/88 MB (used/committed/max)]
> >
> > 2020-06-24 21:10:09,686 INFO
> org.apache.flink.runtime.taskexecutor.TaskManagerRunner
> >   - Garbage collector stats: [PS Scavenge, GC TIME (ms): 108643, GC
> > COUNT: 6981], [PS MarkSweep, GC TIME (ms): 1074, GC COUNT: 6]
> >
> >
> > 省略
> >
> >
> > 2020-06-24 21:55:39,875 INFO
> org.apache.flink.runtime.taskexecutor.TaskManagerRunner
> >   - Direct memory stats: Count: 17403, Total Capacity: 583911423,
> > Used Memory: 583911424
> >
> > 2020-06-24 21:55:39,875 INFO
> org.apache.flink.runtime.taskexecutor.TaskManagerRunner
> >   - Off-heap pool stats: [Code Cache: 35/35/240 MB
> > (used/committed/max)], [Metaspace: 67/69/96 MB (used/committed/max)],
> > [Compressed Class Space: 8/9/88 MB (used/committed/max)]
> >
> > 2020-06-24 21:55:39,876 INFO
> org.apache.flink.runtime.taskexecutor.TaskManagerRunner
> >   - Garbage collector stats: [PS Scavenge, GC TIME (ms): 110520, GC
> > COUNT: 7083], [PS MarkSweep, GC TIME (ms): 1074, GC COUNT: 6]
> >
> > 2020-06-24 21:55:41,721 DEBUG
> > org.apache.flink.streaming.runtime.tasks.StreamTask   - Starting
> > checkpoint (316) CHECKPOINT on task Source: Custom Source -> Map ->
> Filter
> > -> Timestamps/Watermarks (4/10)
> >
> > 2020-06-24 21:55:41,721 DEBUG
> > org.apache.flink.runtime.state.AbstractSnapshotStrategy   -
> > DefaultOperatorStateBackend snapshot (FsCheckpointStorageLocation
> > {fileSystem=org.apache.flink.core.fs.SafetyNetWrapperFileSystem@4a4b11cd
> ,
> > checkpointDirectory=hdfs://chk-316,
> > 

Re: Flink-1.10.0 source的checkpoint偶尔时间比较长

2020-06-27 文章 zhisheng
hi, Tianwang Li

看到有三个图片挂了,可以试着把图片上传到第三方的图床,然后贴个链接过来,另外:

> 任务经常会出现反压(特别是在窗口输出的时候)

这个检查一下窗口下游算子的情况,比如是不是窗口输出的数据过多,而 sink 的并发还和之前的保持一致,导致处理速度跟不上,从而导致的反压。


> 大部分时候checkpoint都是在1分钟内完成,偶尔会出现checkpint需要超过30分钟的(出现的频率不高,1~2天1次)

这种也可以看看是不是 HDFS 有时候压力比较大导致的出现毛刺现象

另外建议补一下 UI 上 chekcpoint 相关的截图和日志信息,这样才能更好的定位问题。


Best !
zhisheng


Tianwang Li  于2020年6月28日周日 上午10:17写道:

> 关于Flink checkpoint偶尔会比较长时间的问题。
>
> *环境与背景:*
> 版本:flink1.10.0
> 数据量:每秒约10万左右的记录,数据源是kafka
> 计算逻辑:滑动窗口统计,每个窗口输出的规模大概1~2千万记录。
> 是否有反压:任务经常会出现反压(特别是在窗口输出的时候)。
>
> *问题:*
> 大部分时候checkpoint都是在1分钟内完成,偶尔会出现checkpint需要超过30分钟的(出现的频率不高,1~2天1次)。
> source的checkpoint消耗的时间比较长。Trigger checkpoint 到 Starting checkpoint消耗时间比较长。
>
> checkpoint情况大致如下:
>
> [image: image.png]
> [image: image.png]
> [image: image.png]
>
> 2020-06-24 21:09:53,369 DEBUG
> org.apache.flink.runtime.taskexecutor.TaskExecutor- Trigger
> checkpoint 316@1593004193363 for 84dce1ec8aa5a4df2d1758d6e9278693.
>
> 2020-06-24 21:09:58,327 DEBUG
> org.apache.flink.runtime.taskexecutor.TaskExecutor- Received
> heartbeat request from e88ea2f790430c9c160e540ef0546d60.
>
> 2020-06-24 21:09:59,266 DEBUG
> org.apache.flink.runtime.taskexecutor.TaskExecutor- Received
> heartbeat request from b93d7167db364dfdcbda886944f1482f.
>
> 2020-06-24 21:09:59,686 INFO  
> org.apache.flink.runtime.taskexecutor.TaskManagerRunner
>   - Memory usage stats: [HEAP: 202/2573/2573 MB, NON HEAP:
> 111/114/424 MB (used/committed/max)]
>
> 2020-06-24 21:09:59,686 INFO  
> org.apache.flink.runtime.taskexecutor.TaskManagerRunner
>   - Direct memory stats: Count: 17403, Total Capacity: 583911423,
> Used Memory: 583911424
>
> 2020-06-24 21:09:59,686 INFO  
> org.apache.flink.runtime.taskexecutor.TaskManagerRunner
>   - Off-heap pool stats: [Code Cache: 35/35/240 MB
> (used/committed/max)], [Metaspace: 67/69/96 MB (used/committed/max)],
> [Compressed Class Space: 8/9/88 MB (used/committed/max)]
>
> 2020-06-24 21:09:59,686 INFO  
> org.apache.flink.runtime.taskexecutor.TaskManagerRunner
>   - Garbage collector stats: [PS Scavenge, GC TIME (ms): 108643, GC
> COUNT: 6981], [PS MarkSweep, GC TIME (ms): 1074, GC COUNT: 6]
>
> 2020-06-24 21:10:08,346 DEBUG
> org.apache.flink.runtime.taskexecutor.TaskExecutor- Received
> heartbeat request from e88ea2f790430c9c160e540ef0546d60.
>
> 2020-06-24 21:10:09,286 DEBUG
> org.apache.flink.runtime.taskexecutor.TaskExecutor- Received
> heartbeat request from b93d7167db364dfdcbda886944f1482f.
>
> 2020-06-24 21:10:09,686 INFO  
> org.apache.flink.runtime.taskexecutor.TaskManagerRunner
>   - Memory usage stats: [HEAP: 557/2573/2573 MB, NON HEAP:
> 111/114/424 MB (used/committed/max)]
>
> 2020-06-24 21:10:09,686 INFO  
> org.apache.flink.runtime.taskexecutor.TaskManagerRunner
>   - Direct memory stats: Count: 17403, Total Capacity: 583911423,
> Used Memory: 583911424
>
> 2020-06-24 21:10:09,686 INFO  
> org.apache.flink.runtime.taskexecutor.TaskManagerRunner
>   - Off-heap pool stats: [Code Cache: 35/35/240 MB
> (used/committed/max)], [Metaspace: 67/69/96 MB (used/committed/max)],
> [Compressed Class Space: 8/9/88 MB (used/committed/max)]
>
> 2020-06-24 21:10:09,686 INFO  
> org.apache.flink.runtime.taskexecutor.TaskManagerRunner
>   - Garbage collector stats: [PS Scavenge, GC TIME (ms): 108643, GC
> COUNT: 6981], [PS MarkSweep, GC TIME (ms): 1074, GC COUNT: 6]
>
>
> 省略
>
>
> 2020-06-24 21:55:39,875 INFO  
> org.apache.flink.runtime.taskexecutor.TaskManagerRunner
>   - Direct memory stats: Count: 17403, Total Capacity: 583911423,
> Used Memory: 583911424
>
> 2020-06-24 21:55:39,875 INFO  
> org.apache.flink.runtime.taskexecutor.TaskManagerRunner
>   - Off-heap pool stats: [Code Cache: 35/35/240 MB
> (used/committed/max)], [Metaspace: 67/69/96 MB (used/committed/max)],
> [Compressed Class Space: 8/9/88 MB (used/committed/max)]
>
> 2020-06-24 21:55:39,876 INFO  
> org.apache.flink.runtime.taskexecutor.TaskManagerRunner
>   - Garbage collector stats: [PS Scavenge, GC TIME (ms): 110520, GC
> COUNT: 7083], [PS MarkSweep, GC TIME (ms): 1074, GC COUNT: 6]
>
> 2020-06-24 21:55:41,721 DEBUG
> org.apache.flink.streaming.runtime.tasks.StreamTask   - Starting
> checkpoint (316) CHECKPOINT on task Source: Custom Source -> Map -> Filter
> -> Timestamps/Watermarks (4/10)
>
> 2020-06-24 21:55:41,721 DEBUG
> org.apache.flink.runtime.state.AbstractSnapshotStrategy   -
> DefaultOperatorStateBackend snapshot (FsCheckpointStorageLocation
> {fileSystem=org.apache.flink.core.fs.SafetyNetWrapperFileSystem@4a4b11cd,
> checkpointDirectory=hdfs://chk-316,
> sharedStateDirectory=hdfs://shared,
> taskOwnedStateDirectory=hdfs://taskowned,
> metadataFilePath=hdfs://chk-316/_metadata, reference=(default),
> fileStateSizeThreshold=1024, writeBufferSize=4096}, synchronous part) in
> thread Thread[Source: Custom Source -> Map -> Filter ->
> Timestamps/Watermarks (4/10),5,Flink Task Threads] took 0 ms.
>

Flink-1.10.0 source的checkpoint偶尔时间比较长

2020-06-27 文章 Tianwang Li
关于Flink checkpoint偶尔会比较长时间的问题。

*环境与背景:*
版本:flink1.10.0
数据量:每秒约10万左右的记录,数据源是kafka
计算逻辑:滑动窗口统计,每个窗口输出的规模大概1~2千万记录。
是否有反压:任务经常会出现反压(特别是在窗口输出的时候)。

*问题:*
大部分时候checkpoint都是在1分钟内完成,偶尔会出现checkpint需要超过30分钟的(出现的频率不高,1~2天1次)。
source的checkpoint消耗的时间比较长。Trigger checkpoint 到 Starting checkpoint消耗时间比较长。

checkpoint情况大致如下:

[image: image.png]
[image: image.png]
[image: image.png]

2020-06-24 21:09:53,369 DEBUG
org.apache.flink.runtime.taskexecutor.TaskExecutor- Trigger
checkpoint 316@1593004193363 for 84dce1ec8aa5a4df2d1758d6e9278693.

2020-06-24 21:09:58,327 DEBUG
org.apache.flink.runtime.taskexecutor.TaskExecutor- Received
heartbeat request from e88ea2f790430c9c160e540ef0546d60.

2020-06-24 21:09:59,266 DEBUG
org.apache.flink.runtime.taskexecutor.TaskExecutor- Received
heartbeat request from b93d7167db364dfdcbda886944f1482f.

2020-06-24 21:09:59,686 INFO
org.apache.flink.runtime.taskexecutor.TaskManagerRunner
  - Memory usage stats: [HEAP: 202/2573/2573 MB, NON HEAP: 111/114/424
MB (used/committed/max)]

2020-06-24 21:09:59,686 INFO
org.apache.flink.runtime.taskexecutor.TaskManagerRunner
  - Direct memory stats: Count: 17403, Total Capacity: 583911423, Used
Memory: 583911424

2020-06-24 21:09:59,686 INFO
org.apache.flink.runtime.taskexecutor.TaskManagerRunner
  - Off-heap pool stats: [Code Cache: 35/35/240 MB
(used/committed/max)], [Metaspace: 67/69/96 MB (used/committed/max)],
[Compressed Class Space: 8/9/88 MB (used/committed/max)]

2020-06-24 21:09:59,686 INFO
org.apache.flink.runtime.taskexecutor.TaskManagerRunner
  - Garbage collector stats: [PS Scavenge, GC TIME (ms): 108643, GC
COUNT: 6981], [PS MarkSweep, GC TIME (ms): 1074, GC COUNT: 6]

2020-06-24 21:10:08,346 DEBUG
org.apache.flink.runtime.taskexecutor.TaskExecutor- Received
heartbeat request from e88ea2f790430c9c160e540ef0546d60.

2020-06-24 21:10:09,286 DEBUG
org.apache.flink.runtime.taskexecutor.TaskExecutor- Received
heartbeat request from b93d7167db364dfdcbda886944f1482f.

2020-06-24 21:10:09,686 INFO
org.apache.flink.runtime.taskexecutor.TaskManagerRunner
  - Memory usage stats: [HEAP: 557/2573/2573 MB, NON HEAP: 111/114/424
MB (used/committed/max)]

2020-06-24 21:10:09,686 INFO
org.apache.flink.runtime.taskexecutor.TaskManagerRunner
  - Direct memory stats: Count: 17403, Total Capacity: 583911423, Used
Memory: 583911424

2020-06-24 21:10:09,686 INFO
org.apache.flink.runtime.taskexecutor.TaskManagerRunner
  - Off-heap pool stats: [Code Cache: 35/35/240 MB
(used/committed/max)], [Metaspace: 67/69/96 MB (used/committed/max)],
[Compressed Class Space: 8/9/88 MB (used/committed/max)]

2020-06-24 21:10:09,686 INFO
org.apache.flink.runtime.taskexecutor.TaskManagerRunner
  - Garbage collector stats: [PS Scavenge, GC TIME (ms): 108643, GC
COUNT: 6981], [PS MarkSweep, GC TIME (ms): 1074, GC COUNT: 6]


省略


2020-06-24 21:55:39,875 INFO
org.apache.flink.runtime.taskexecutor.TaskManagerRunner
  - Direct memory stats: Count: 17403, Total Capacity: 583911423, Used
Memory: 583911424

2020-06-24 21:55:39,875 INFO
org.apache.flink.runtime.taskexecutor.TaskManagerRunner
  - Off-heap pool stats: [Code Cache: 35/35/240 MB
(used/committed/max)], [Metaspace: 67/69/96 MB (used/committed/max)],
[Compressed Class Space: 8/9/88 MB (used/committed/max)]

2020-06-24 21:55:39,876 INFO
org.apache.flink.runtime.taskexecutor.TaskManagerRunner
  - Garbage collector stats: [PS Scavenge, GC TIME (ms): 110520, GC
COUNT: 7083], [PS MarkSweep, GC TIME (ms): 1074, GC COUNT: 6]

2020-06-24 21:55:41,721 DEBUG
org.apache.flink.streaming.runtime.tasks.StreamTask   - Starting
checkpoint (316) CHECKPOINT on task Source: Custom Source -> Map -> Filter
-> Timestamps/Watermarks (4/10)

2020-06-24 21:55:41,721 DEBUG
org.apache.flink.runtime.state.AbstractSnapshotStrategy   -
DefaultOperatorStateBackend snapshot (FsCheckpointStorageLocation
{fileSystem=org.apache.flink.core.fs.SafetyNetWrapperFileSystem@4a4b11cd,
checkpointDirectory=hdfs://chk-316,
sharedStateDirectory=hdfs://shared,
taskOwnedStateDirectory=hdfs://taskowned,
metadataFilePath=hdfs://chk-316/_metadata, reference=(default),
fileStateSizeThreshold=1024, writeBufferSize=4096}, synchronous part) in
thread Thread[Source: Custom Source -> Map -> Filter ->
Timestamps/Watermarks (4/10),5,Flink Task Threads] took 0 ms.

2020-06-24 21:55:41,721 DEBUG
org.apache.flink.runtime.state.AbstractSnapshotStrategy   -
DefaultOperatorStateBackend snapshot (FsCheckpointStorageLocation
{fileSystem=org.apache.flink.core.fs.SafetyNetWrapperFileSystem@4a4b11cd,
checkpointDirectory=hdfs://chk-316,
sharedStateDirectory=hdfs://shared,
taskOwnedStateDirectory=hdfs://taskowned,
metadataFilePath=hdfs://chk-316/_metadata, reference=(default),
fileStateSizeThreshold=1024, writeBufferSize=4096}, synchronous part) in
thread Thread[Source: Custom Source -> Map -> Filter ->
Timestamps/Watermarks (4/10),5,Flink