Hi,
    We are running Flink in K8S. We used 
https://ci.apache.org/projects/flink/flink-docs-release-1.10/ops/jobmanager_high_availability.html
    to set high availability. We set max number of retries for a task to 2. 
After task fails twice and then the job manager fails. This is expected. But it 
is removing checkpoint from the zookeeper. As a result on the restart it is not 
consuming from the previous checkpoint. We are losing the data.

Logs:

2020/06/06 19:39:07.759 INFO  o.a.f.r.c.CheckpointCoordinator - Stopping 
checkpoint coordinator for job 00000000000000000000000000000000.
2020/06/06 19:39:07.759 INFO  o.a.f.r.c.ZooKeeperCompletedCheckpointStore - 
Shutting down
2020/06/06 19:39:07.823 INFO  o.a.f.r.z.ZooKeeperStateHandleStore - Removing 
/flink/sessionization_test4/checkpoints/00000000000000000000000000000000 from 
ZooKeeper
2020/06/06 19:39:07.823 INFO  o.a.f.r.c.CompletedCheckpoint - Checkpoint with 
ID 11 at 
's3://s3_bucket/sessionization_test/checkpoints/00000000000000000000000000000000/chk-11'
 not discarded.
2020/06/06 19:39:07.829 INFO  o.a.f.r.c.ZooKeeperCheckpointIDCounter - Shutting 
down.
2020/06/06 19:39:07.829 INFO  o.a.f.r.c.ZooKeeperCheckpointIDCounter - Removing 
/checkpoint-counter/00000000000000000000000000000000 from ZooKeeper
2020/06/06 19:39:07.852 INFO  o.a.f.r.dispatcher.MiniDispatcher - Job 
00000000000000000000000000000000 reached globally terminal state FAILED.
2020/06/06 19:39:07.852 INFO  o.a.f.runtime.jobmaster.JobMaster - Stopping the 
JobMaster for job 
sppstandardresourcemanager-flink-0606193838-6d7dae7e(00000000000000000000000000000000).
2020/06/06 19:39:07.854 INFO  o.a.f.r.entrypoint.ClusterEntrypoint - Shutting 
StandaloneJobClusterEntryPoint down with application status FAILED. Diagnostics 
null.
2020/06/06 19:39:07.854 INFO  o.a.f.r.j.MiniDispatcherRestEndpoint - Shutting 
down rest endpoint.
2020/06/06 19:39:07.856 INFO  o.a.f.r.l.ZooKeeperLeaderRetrievalService - 
Stopping ZooKeeperLeaderRetrievalService /leader/resource_manager_lock.
2020/06/06 19:39:07.859 INFO  o.a.f.r.j.slotpool.SlotPoolImpl - Suspending 
SlotPool.
2020/06/06 19:39:07.859 INFO  o.a.f.runtime.jobmaster.JobMaster - Close 
ResourceManager connection d28e9b9e1fc1ba78c2ed010070518057: JobManager is 
shutting down..
2020/06/06 19:39:07.859 INFO  o.a.f.r.j.slotpool.SlotPoolImpl - Stopping 
SlotPool.
2020/06/06 19:39:07.859 INFO  o.a.f.r.r.StandaloneResourceManager - Disconnect 
job manager 
afae482ff82bdb26fe275174c14d4...@akka.tcp://flink@flink-job-cluster:6123/user/jobmanager_0
 for job 00000000000000000000000000000000 from the resource manager.
2020/06/06 19:39:07.860 INFO  o.a.f.r.l.ZooKeeperLeaderElectionService - 
Stopping ZooKeeperLeaderElectionService 
ZooKeeperLeaderElectionService{leaderPath='/leader/00000000000000000000000000000000/job_manager_lock'}.
2020/06/06 19:39:07.868 INFO  o.a.f.r.j.MiniDispatcherRestEndpoint - Removing 
cache directory /tmp/flink-web-ef940924-348b-461c-ab53-255a914ed43a/flink-web-ui
2020/06/06 19:39:07.870 INFO  o.a.f.r.l.ZooKeeperLeaderElectionService - 
Stopping ZooKeeperLeaderElectionService 
ZooKeeperLeaderElectionService{leaderPath='/leader/rest_server_lock'}.
2020/06/06 19:39:07.870 INFO  o.a.f.r.j.MiniDispatcherRestEndpoint - Shut down 
complete.
2020/06/06 19:39:07.870 INFO  o.a.f.r.r.StandaloneResourceManager - Shut down 
cluster because application is in FAILED, diagnostics null.
2020/06/06 19:39:07.870 INFO  o.a.f.r.l.ZooKeeperLeaderRetrievalService - 
Stopping ZooKeeperLeaderRetrievalService /leader/dispatcher_lock.
2020/06/06 19:39:07.871 INFO  o.a.f.r.l.ZooKeeperLeaderRetrievalService - 
Stopping ZooKeeperLeaderRetrievalService /leader/resource_manager_lock.
2020/06/06 19:39:07.871 INFO  o.a.f.r.dispatcher.MiniDispatcher - Stopping 
dispatcher akka.tcp://flink@flink-job-cluster:6123/user/dispatcher.
2020/06/06 19:39:07.871 INFO  o.a.f.r.dispatcher.MiniDispatcher - Stopping all 
currently running jobs of dispatcher 
akka.tcp://flink@flink-job-cluster:6123/user/dispatcher.
2020/06/06 19:39:07.871 INFO  o.a.f.r.r.s.SlotManagerImpl - Closing the 
SlotManager.
2020/06/06 19:39:07.871 INFO  o.a.f.r.r.s.SlotManagerImpl - Suspending the 
SlotManager.
2020/06/06 19:39:07.871 INFO  o.a.f.r.l.ZooKeeperLeaderElectionService - 
Stopping ZooKeeperLeaderElectionService 
ZooKeeperLeaderElectionService{leaderPath='/leader/resource_manager_lock'}.
2020/06/06 19:39:07.871 INFO  o.a.f.r.l.ZooKeeperLeaderRetrievalService - 
Stopping ZooKeeperLeaderRetrievalService 
/leader/00000000000000000000000000000000/job_manager_lock.
2020/06/06 19:39:07.974 INFO  o.a.f.r.r.h.l.b.StackTraceSampleCoordinator - 
Shutting down stack trace sample coordinator.
2020/06/06 19:39:07.975 INFO  o.a.f.r.l.ZooKeeperLeaderElectionService - 
Stopping ZooKeeperLeaderElectionService 
ZooKeeperLeaderElectionService{leaderPath='/leader/dispatcher_lock'}.
2020/06/06 19:39:07.975 INFO  o.a.f.r.dispatcher.MiniDispatcher - Stopped 
dispatcher akka.tcp://flink@flink-job-cluster:6123/user/dispatcher.
2020/06/06 19:39:07.975 INFO  o.a.flink.runtime.blob.BlobServer - Stopped BLOB 
server at 0.0.0.0:6124
2020/06/06 19:39:07.975 INFO  o.a.f.r.h.z.ZooKeeperHaServices - Close and clean 
up all data for ZooKeeperHaServices.
2020/06/06 19:39:08.085 INFO  o.a.f.s.c.o.a.c.f.i.CuratorFrameworkImpl - 
backgroundOperationsLoop exiting
2020/06/06 19:39:08.090 INFO  o.a.f.s.z.o.a.zookeeper.ClientCnxn - EventThread 
shut down for session: 0x17282452e8c0823
2020/06/06 19:39:08.090 INFO  o.a.f.s.z.o.a.zookeeper.ZooKeeper - Session: 
0x17282452e8c0823 closed
2020/06/06 19:39:08.091 INFO  o.a.f.r.rpc.akka.AkkaRpcService - Stopping Akka 
RPC service.
2020/06/06 19:39:08.093 INFO  o.a.f.r.rpc.akka.AkkaRpcService - Stopping Akka 
RPC service.
2020/06/06 19:39:08.096 INFO  a.r.RemoteActorRefProvider$RemotingTerminator - 
Shutting down remote daemon.
2020/06/06 19:39:08.097 INFO  a.r.RemoteActorRefProvider$RemotingTerminator - 
Remote daemon shut down; proceeding with flushing remote transports.
2020/06/06 19:39:08.099 INFO  a.r.RemoteActorRefProvider$RemotingTerminator - 
Shutting down remote daemon.
2020/06/06 19:39:08.099 INFO  a.r.RemoteActorRefProvider$RemotingTerminator - 
Remote daemon shut down; proceeding with flushing remote transports.
2020/06/06 19:39:08.108 INFO  a.r.RemoteActorRefProvider$RemotingTerminator - 
Remoting shut down.
2020/06/06 19:39:08.114 INFO  a.r.RemoteActorRefProvider$RemotingTerminator - 
Remoting shut down.
2020/06/06 19:39:08.123 INFO  o.a.f.r.rpc.akka.AkkaRpcService - Stopped Akka 
RPC service.
2020/06/06 19:39:08.124 INFO  o.a.f.r.entrypoint.ClusterEntrypoint - 
Terminating cluster entrypoint process StandaloneJobClusterEntryPoint with exit 
code 1443.









Can you please help?

Thanks
Sandeep Kathula

Reply via email to