What are exactly the problems when the checkpoint recovery does not work?
Even if the ZooKeeper connection is temporarily disconnected which leads to
the JobMaster losing leadership and the job being suspended, the next
leader should continue where the first job left stopped because of the lost
ZooKeeper connection.

What happens under the hood when restoring from a savepoint is that it is
inserted into the CompletedCheckpointStore where also the other checkpoints
are stored. If now a failure happens, Flink will first try to recover from
a checkpoint/savepoint from the CompletedCheckpointStore and only if this
store does not contain any checkpoints/savepoints, it will use the
savepoint with which the job is started. The CompletedCheckpointStore
persists the checkpoint/savepoint information by writing the pointers to
ZooKeeper.

Cheers,
Till

On Mon, Dec 21, 2020 at 11:38 AM vishalovercome <vis...@moengage.com> wrote:

> Thanks for your reply!
>
> What I have seen is that the job terminates when there's intermittent loss
> of connectivity with zookeeper. This is in-fact the most common reason why
> our jobs are terminating at this point. Worse, it's unable to restore from
> checkpoint during some (not all) of these terminations. Under these
> scenarios, won't the job try to recover from a savepoint?
>
> I've gone through various tickets reporting stability issues due to
> zookeeper that you've mentioned you intend to resolve soon. But until the
> zookeeper based HA is stable, should we assume that it will repeatedly
> restore from savepoints? I would rather rely on kafka offsets to resume
> where it left off rather than savepoints.
>
>
>
> --
> Sent from:
> http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/
>

Reply via email to