Hi
>From my experience, you can first check the jobmanager.log, find out
whether the checkpoint expired or was declined by some task, if expired,
you can follow the adivce of seeksst given above(maybe enable debug log can
help you here), if was declined, then you can go to the taskmanager.log to
seeksst has already covered many of the relevant points, but a few more
thoughts:
I would start by checking to see if the checkpoints are failing because
they timeout, or for some other reason. Assuming they are timing out, then
a good place to start is to look and see if this can be explained by
Hi:
according to my experience, there are several possible reasons for
checkpoint fail.
1. if you use rocksdb as backend, insufficient disk will cause it.
because file save on local disk, and you may see a exception.
2. Sink can’t be written. all parallelism can’t be
We have a topology and the checkpoints fail to complete a *lot* of the time.
Typically it is just one subtask that fails.
We have a parallelism of 2 on this topology at present and the other
subtask will complete in 3ms though the end to end duration on the rare
times when the checkpointing