What are exactly the problems when the checkpoint recovery does not work? Even if the ZooKeeper connection is temporarily disconnected which leads to the JobMaster losing leadership and the job being suspended, the next leader should continue where the first job left stopped because of the lost ZooKeeper connection.
What happens under the hood when restoring from a savepoint is that it is inserted into the CompletedCheckpointStore where also the other checkpoints are stored. If now a failure happens, Flink will first try to recover from a checkpoint/savepoint from the CompletedCheckpointStore and only if this store does not contain any checkpoints/savepoints, it will use the savepoint with which the job is started. The CompletedCheckpointStore persists the checkpoint/savepoint information by writing the pointers to ZooKeeper. Cheers, Till On Mon, Dec 21, 2020 at 11:38 AM vishalovercome <[email protected]> wrote: > Thanks for your reply! > > What I have seen is that the job terminates when there's intermittent loss > of connectivity with zookeeper. This is in-fact the most common reason why > our jobs are terminating at this point. Worse, it's unable to restore from > checkpoint during some (not all) of these terminations. Under these > scenarios, won't the job try to recover from a savepoint? > > I've gone through various tickets reporting stability issues due to > zookeeper that you've mentioned you intend to resolve soon. But until the > zookeeper based HA is stable, should we assume that it will repeatedly > restore from savepoints? I would rather rely on kafka offsets to resume > where it left off rather than savepoints. > > > > -- > Sent from: > http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/ >
