Hi All,

Occasionally I run into failed checkpoints error where 2 or 3 consecutive
checkpoints fails after running for a minute and then it recovers. This is
causing delay in processing the incoming data since there is huge amount of
data buffered during the failed checkpoints. I don't see any errors in the
taskmanager logs but here is the error in the jobmanager log. The state
size is around 100 mb.

*Checkpoint configuration:*
Option Value
Checkpointing Mode Exactly Once
Interval 1m 0s
Timeout 1m 0s
Minimum Pause Between Checkpoints 5s
Maximum Concurrent Checkpoints 1
Persist Checkpoints Externally Enabled (retain on cancellation)
*Jobmanager Log:*

2019-07-05 17:53:54,125 [flink-akka.actor.default-dispatcher-465901] WARN
o.a.f.r.c.CheckpointCoordinator - Received late message for now expired
checkpoint attempt 9867 from 79515b6550d2c223701be0a9c870995f of job
00ff93caa4cc9464bd41e1d050fcf65c.
2019-07-05 17:53:54,141 [flink-akka.actor.default-dispatcher-465901] WARN
o.a.f.r.c.CheckpointCoordinator - Received late message for now expired
checkpoint attempt 9867 from 630984cdd5e66b4d9ea95a91cb4d23f6 of job
00ff93caa4cc9464bd41e1d050fcf65c.
2019-07-05 17:53:54,168 [flink-akka.actor.default-dispatcher-465901] WARN
o.a.f.r.c.CheckpointCoordinator - Received late message for now expired
checkpoint attempt 9867 from e12ed2e185a37559f93181905a52ebeb of job
00ff93caa4cc9464bd41e1d050fcf65c.
2019-07-05 17:53:54,215 [flink-akka.actor.default-dispatcher-465901] WARN
o.a.f.r.c.CheckpointCoordinator - Received late message for now expired
checkpoint attempt 9867 from 1fede192e2ff11e0905d98ff5ff6f9ce of job
00ff93caa4cc9464bd41e1d050fcf65c.
2019-07-05 17:53:54,223 [flink-akka.actor.default-dispatcher-465901] WARN
o.a.f.r.c.CheckpointCoordinator - Received late message for now expired
checkpoint attempt 9867 from d4e895eb20cc259c95b249cd0252930f of job
00ff93caa4cc9464bd41e1d050fcf65c.
2019-07-05 17:53:54,310 [flink-akka.actor.default-dispatcher-465901] WARN
o.a.f.r.c.CheckpointCoordinator - Received late message for now expired
checkpoint attempt 9867 from be5c711d7b37ed6d8022224dc447db91 of job
00ff93caa4cc9464bd41e1d050fcf65c.
2019-07-05 17:53:54,351 [flink-akka.actor.default-dispatcher-465901] WARN
o.a.f.r.c.CheckpointCoordinator - Received late message for now expired
checkpoint attempt 9867 from 1ed52695cc407f2f143d2bb5d23cbdbb of job
00ff93caa4cc9464bd41e1d050fcf65c.
2019-07-05 17:53:54,398 [flink-akka.actor.default-dispatcher-465901] WARN
o.a.f.r.c.CheckpointCoordinator - Received late message for now expired
checkpoint attempt 9867 from 2e43cf968ad399c0b8426239a7dd081c of job
00ff93caa4cc9464bd41e1d050fcf65c.
2019-07-05 17:53:54,959 [flink-akka.actor.default-dispatcher-465868] INFO
o.a.f.r.c.CheckpointCoordinator - Completed checkpoint 9868 (279307042
bytes in 50707 ms).
2019-07-05 17:54:04,174 [Checkpoint Timer] INFO
o.a.f.r.c.CheckpointCoordinator - Triggering checkpoint 9869 @ 1562349244171
2019-07-05 17:54:10,709 [flink-akka.actor.default-dispatcher-465905] INFO
o.a.f.r.c.CheckpointCoordinator - Completed checkpoint 9869 (253638470
bytes in 6430 ms).
2019-07-05 17:55:04,174 [Checkpoint Timer] INFO
o.a.f.r.c.CheckpointCoordinator - Triggering checkpoint 9870 @ 1562349304171
2019-07-05 17:55:09,816 [flink-akka.actor.default-dispatcher-465913] INFO
o.a.f.r.c.CheckpointCoordinator - Completed checkpoint 9870 (138649543
bytes in 5551 ms).
2019-07-05 17:56:04,174 [Checkpoint Timer] INFO
o.a.f.r.c.CheckpointCoordinator - Triggering checkpoint 9871 @ 1562349364171

Thanks

Reply via email to