> How do I identify the leader job manager? (see the third point below on approach to date)

Either through the config maps, or the logs. Your approach for the config maps _should_ be correct. There should be an "address" which in each ConfigMap pointing to the current leader.

> How is the data stored at high-availability.storageDir used when the Leader is killed?

The storage dir contains the job graphs that were submitted by client, such that on failover the new JM knows which jobs it should start up again.

> How would I best test if the HA setup is working?

Killing the leader is the best way to test it, while it is of course a rather destructive mean to do so it also tests the very case you are interested in. A weaker test would be submitting a job to the non-leading JM, which should forward it to the leader.

On 01/09/2021 19:53, jonas eyob wrote:
Hey all,

I have a 2 Job Manager 1 Task Manager (2 slots) setup. Wanted to simply try to see if the leader election would work correctly.

We are using:
* Standalone Application Cluster setup on Kubernetes, and have followed the example configurations provided in the documentation for HA. * Using the org.apache.flink.kubernetes.highavailability.KubernetesHaServicesFactory * Using an s3 bucket for high-avalability.storageDir (with the presto plugin). * Beam (with flink executor) and are consuming events from a Kinesis stream.

When starting up, I see a lot of activity in both JobManagers invoking the election service, and eventually, the task manager is able to connect to the leader. Running also verifies that the taskmanager is consuming from the stream.

Now there are a few questions:
- How do I identify the leader job manager? (see the third point below on approach to date) - How is the data stored at high-availability.storageDir used when the Leader is killed? From what I see in the S3 bucket we have a file called "submittedJobGraph716ad4dcd04e" - I understood the new leader would restore from this? - Looking at the configmaps created as part of the HA: dispatch-leader, restserver-leader, ..., I understand it as the JobManager leader-address should be present in these? But doesn't look like it? - How would I best test if the HA setup is working? I have so far tried killing one of the JobManagers that I thought was the leader using (kubectl exec PODNAME -- /bin/sh -c "kill 1") -- is there a better way?

LOGS from Job Manager 1
------------------------------------------------------------
2021-08-31 14:45:44,159 INFO  org.apache.flink.runtime.jobmaster.MiniDispatcherRestEndpoint [] - Rest endpoint listening at 100.117.0.6:8081 <http://100.117.0.6:8081> 2021-08-31 14:45:44,953 INFO  org.apache.flink.kubernetes.kubeclient.resources.KubernetesLeaderElector [] - Create KubernetesLeaderElector thoros-restserver-leader with lock identity 04658651-a7a9-47cb-aa51-2024343607fe. 2021-08-31 14:45:47,791 INFO  org.apache.flink.kubernetes.kubeclient.resources.KubernetesLeaderElector [] - New leader elected 04658651-a7a9-47cb-aa51-2024343607fe for thoros-restserver-leader. 2021-08-31 14:45:48,217 INFO  org.apache.flink.runtime.leaderelection.DefaultLeaderElectionService [] - Starting DefaultLeaderElectionService with KubernetesLeaderElectionDriver{configMapName='thoros-restserver-leader'}. 2021-08-31 14:45:48,246 INFO  org.apache.flink.runtime.jobmaster.MiniDispatcherRestEndpoint [] - Web frontend listening at http://100.117.0.6:8081 <http://100.117.0.6:8081>. 2021-08-31 14:45:48,261 INFO  org.apache.flink.runtime.jobmaster.MiniDispatcherRestEndpoint [] - http://100.117.0.6:8081 <http://100.117.0.6:8081> was granted leadership with leaderSessionID=64f52670-fe1b-452a-babe-c5fd94b38133 2021-08-31 14:45:48,618 INFO  org.apache.flink.runtime.rpc.akka.AkkaRpcService [] - Starting RPC endpoint for org.apache.flink.runtime.resourcemanager.StandaloneResourceManager at akka://flink/user/rpc/resourcemanager_0 . 2021-08-31 14:45:48,689 INFO  org.apache.flink.kubernetes.kubeclient.resources.KubernetesLeaderElector [] - Create KubernetesLeaderElector thoros-dispatcher-leader with lock identity 04658651-a7a9-47cb-aa51-2024343607fe. 2021-08-31 14:45:48,785 INFO  org.apache.flink.runtime.leaderelection.DefaultLeaderElectionService [] - Starting DefaultLeaderElectionService with KubernetesLeaderElectionDriver{configMapName='thoros-dispatcher-leader'}. 2021-08-31 14:45:48,793 INFO  org.apache.flink.kubernetes.kubeclient.resources.KubernetesLeaderElector [] - Create KubernetesLeaderElector thoros-resourcemanager-leader with lock identity 04658651-a7a9-47cb-aa51-2024343607fe. 2021-08-31 14:45:49,088 INFO  org.apache.flink.runtime.leaderretrieval.DefaultLeaderRetrievalService [] - Starting DefaultLeaderRetrievalService with KubernetesLeaderRetrievalDriver{configMapName='thoros-resourcemanager-leader'}. 2021-08-31 14:45:49,256 INFO  org.apache.flink.kubernetes.kubeclient.resources.KubernetesLeaderElector [] - New leader elected 95fa9050-a8c9-44bb-bf5a-7322af43ea9d for thoros-dispatcher-leader. 2021-08-31 14:45:49,280 INFO  org.apache.flink.runtime.leaderretrieval.DefaultLeaderRetrievalService [] - Starting DefaultLeaderRetrievalService with KubernetesLeaderRetrievalDriver{configMapName='thoros-dispatcher-leader'}. 2021-08-31 14:45:49,376 INFO  org.apache.flink.runtime.leaderelection.DefaultLeaderElectionService [] - Starting DefaultLeaderElectionService with KubernetesLeaderElectionDriver{configMapName='thoros-resourcemanager-leader'}. 2021-08-31 14:45:49,465 INFO  org.apache.flink.kubernetes.kubeclient.resources.KubernetesLeaderElector [] - New leader elected 04658651-a7a9-47cb-aa51-2024343607fe for thoros-resourcemanager-leader. 2021-08-31 14:45:49,480 INFO  org.apache.flink.runtime.resourcemanager.StandaloneResourceManager [] - ResourceManager akka.tcp://[email protected]:6123/user/rpc/resourcemanager_0 <http://[email protected]:6123/user/rpc/resourcemanager_0> was granted leadership with fencing token 967f8e81eab9fc2078d062ad11b1405c 2021-08-31 14:45:49,493 INFO  org.apache.flink.runtime.resourcemanager.slotmanager.SlotManagerImpl [] - Starting the SlotManager. 2021-08-31 14:45:54,049 WARN  akka.remote.transport.netty.NettyTransport [] - Remote connection to [null] failed with java.net.NoRouteToHostException: No route to host 2021-08-31 14:45:54,064 WARN  akka.remote.transport.netty.NettyTransport [] - Remote connection to [null] failed with java.net.NoRouteToHostException: No route to host 2021-08-31 14:45:54,113 WARN  akka.remote.ReliableDeliverySupervisor [] - Association with remote system [akka.tcp://[email protected]:6123 <http://[email protected]:6123>] has failed, address is now gated for [50] ms. Reason: [Association failed with [akka.tcp://[email protected]:6123 <http://[email protected]:6123>]] Caused by: [java.net.NoRouteToHostException: No route to host] 2021-08-31 14:45:54,321 WARN  akka.remote.ReliableDeliverySupervisor [] - Association with remote system [akka.tcp://[email protected]:6123 <http://[email protected]:6123>] has failed, address is now gated for [50] ms. Reason: [Association failed with [akka.tcp://[email protected]:6123 <http://[email protected]:6123>]] Caused by: [java.net.NoRouteToHostException: No route to host] 2021-08-31 14:45:55,789 INFO  org.apache.flink.runtime.resourcemanager.StandaloneResourceManager [] - Registering TaskManager with ResourceID 100.117.0.4:6122-69262a (akka.tcp://[email protected]:6122/user/rpc/taskmanager_0 <http://[email protected]:6122/user/rpc/taskmanager_0>) at ResourceManager 2021-08-31 14:45:55,890 INFO  org.apache.flink.runtime.resourcemanager.StandaloneResourceManager [] - Registering TaskManager with ResourceID 100.117.0.4:6122-69262a (akka.tcp://[email protected]:6122/user/rpc/taskmanager_0 <http://[email protected]:6122/user/rpc/taskmanager_0>) at ResourceManager 2021-08-31 14:45:59,988 INFO  org.apache.flink.runtime.leaderretrieval.DefaultLeaderRetrievalService [] - Starting DefaultLeaderRetrievalService with KubernetesLeaderRetrievalDriver{configMapName='thoros-00000000000000000000000000000000-jobmanager-leader'}. 2021-08-31 14:45:59,988 INFO  org.apache.flink.runtime.resourcemanager.StandaloneResourceManager [] - Registering job manager [email protected]://[email protected]:6123/user/rpc/jobmanager_2 <http://[email protected]:6123/user/rpc/jobmanager_2> for job 00000000000000000000000000000000. 2021-08-31 14:46:00,076 INFO  org.apache.flink.runtime.resourcemanager.StandaloneResourceManager [] - Registering job manager [email protected]://[email protected]:6123/user/rpc/jobmanager_2 <http://[email protected]:6123/user/rpc/jobmanager_2> for job 00000000000000000000000000000000. 2021-08-31 14:46:00,294 INFO  org.apache.flink.runtime.resourcemanager.StandaloneResourceManager [] - Registering job manager [email protected]://[email protected]:6123/user/rpc/jobmanager_2 <http://[email protected]:6123/user/rpc/jobmanager_2> for job 00000000000000000000000000000000. 2021-08-31 14:46:00,753 INFO  org.apache.flink.runtime.resourcemanager.StandaloneResourceManager [] - Registering job manager [email protected]://[email protected]:6123/user/rpc/jobmanager_2 <http://[email protected]:6123/user/rpc/jobmanager_2> for job 00000000000000000000000000000000. 2021-08-31 14:46:01,519 INFO  org.apache.flink.runtime.resourcemanager.StandaloneResourceManager [] - Registering job manager [email protected]://[email protected]:6123/user/rpc/jobmanager_2 <http://[email protected]:6123/user/rpc/jobmanager_2> for job 00000000000000000000000000000000. 2021-08-31 14:46:03,084 INFO  org.apache.flink.runtime.resourcemanager.StandaloneResourceManager [] - Registered job manager [email protected]://[email protected]:6123/user/rpc/jobmanager_2 <http://[email protected]:6123/user/rpc/jobmanager_2> for job 00000000000000000000000000000000. 2021-08-31 14:46:03,094 INFO  org.apache.flink.runtime.resourcemanager.StandaloneResourceManager [] - Registered job manager [email protected]://[email protected]:6123/user/rpc/jobmanager_2 <http://[email protected]:6123/user/rpc/jobmanager_2> for job 00000000000000000000000000000000. 2021-08-31 14:46:03,102 INFO  org.apache.flink.runtime.resourcemanager.StandaloneResourceManager [] - Registered job manager [email protected]://[email protected]:6123/user/rpc/jobmanager_2 <http://[email protected]:6123/user/rpc/jobmanager_2> for job 00000000000000000000000000000000. 2021-08-31 14:46:03,109 INFO  org.apache.flink.runtime.resourcemanager.StandaloneResourceManager [] - Registered job manager [email protected]://[email protected]:6123/user/rpc/jobmanager_2 <http://[email protected]:6123/user/rpc/jobmanager_2> for job 00000000000000000000000000000000. 2021-08-31 14:46:03,118 INFO  org.apache.flink.runtime.resourcemanager.StandaloneResourceManager [] - Registered job manager [email protected]://[email protected]:6123/user/rpc/jobmanager_2 <http://[email protected]:6123/user/rpc/jobmanager_2> for job 00000000000000000000000000000000. 2021-08-31 14:46:03,186 INFO  org.apache.flink.runtime.resourcemanager.StandaloneResourceManager [] - Request slot with profile ResourceProfile{UNKNOWN} for job 00000000000000000000000000000000 with allocation id c57a30bfc082abe9355c6ae4bb1b4c8a. 2021-08-31 14:46:03,279 INFO  org.apache.flink.runtime.resourcemanager.StandaloneResourceManager [] - Request slot with profile ResourceProfile{UNKNOWN} for job 00000000000000000000000000000000 with allocation id c07339fbf640585c403edba93be5b09c. 2021-08-31 14:59:18,486 WARN  akka.remote.ReliableDeliverySupervisor [] - Association with remote system [akka.tcp://[email protected]:6122 <http://[email protected]:6122>] has failed, address is now gated for [50] ms. Reason: [Disassociated] 2021-08-31 14:59:21,411 WARN  akka.remote.transport.netty.NettyTransport [] - Remote connection to [null] failed with java.net.ConnectException: Connection refused: /100.117.0.4:6122 <http://100.117.0.4:6122> 2021-08-31 14:59:21,450 WARN  akka.remote.ReliableDeliverySupervisor [] - Association with remote system [akka.tcp://[email protected]:6122 <http://[email protected]:6122>] has failed, address is now gated for [50] ms. Reason: [Association failed with [akka.tcp://[email protected]:6122 <http://[email protected]:6122>]] Caused by: [java.net.ConnectException: Connection refused: /100.117.0.4:6122 <http://100.117.0.4:6122>] 2021-08-31 14:59:31,447 WARN  akka.remote.transport.netty.NettyTransport [] - Remote connection to [null] failed with java.net.ConnectException: Connection refused: /100.117.0.4:6122 <http://100.117.0.4:6122> 2021-08-31 14:59:31,453 WARN  akka.remote.ReliableDeliverySupervisor [] - Association with remote system [akka.tcp://[email protected]:6122 <http://[email protected]:6122>] has failed, address is now gated for [50] ms. Reason: [Association failed with [akka.tcp://[email protected]:6122 <http://[email protected]:6122>]] Caused by: [java.net.ConnectException: Connection refused: /100.117.0.4:6122 <http://100.117.0.4:6122>] 2021-08-31 14:59:34,914 INFO  org.apache.flink.runtime.resourcemanager.StandaloneResourceManager [] - Registering TaskManager with ResourceID 100.117.0.4:6122-9f2331 (akka.tcp://[email protected]:6122/user/rpc/taskmanager_0 <http://[email protected]:6122/user/rpc/taskmanager_0>) at ResourceManager 2021-08-31 15:00:01,405 INFO  org.apache.flink.runtime.resourcemanager.StandaloneResourceManager [] - The heartbeat of TaskManager with id 100.117.0.4:6122-69262a timed out. 2021-08-31 15:00:01,406 INFO  org.apache.flink.runtime.resourcemanager.StandaloneResourceManager [] - Closing TaskExecutor connection 100.117.0.4:6122-69262a because: The heartbeat of TaskManager with id 100.117.0.4:6122-69262a  timed out. 2021-08-31 15:00:02,793 INFO  org.apache.flink.runtime.resourcemanager.StandaloneResourceManager [] - Request slot with profile ResourceProfile{UNKNOWN} for job 00000000000000000000000000000000 with allocation id ec3e0def8314dea150bb281818828b55. 2021-08-31 15:00:02,796 INFO  org.apache.flink.runtime.resourcemanager.StandaloneResourceManager [] - Request slot with profile ResourceProfile{UNKNOWN} for job 00000000000000000000000000000000 with allocation id 2b8d9ba4a7c7895dbb0dba4f16006441.

Excerpt of LOGS after (what I think is killing the leader) a new JobManager ("JobManager 2") pod is created to replace the killed one
------------------------------------------------------------
2021-08-31 15:00:02,778 INFO  org.apache.flink.runtime.checkpoint.DefaultCompletedCheckpointStore [] - Recovering checkpoints from KubernetesStateHandleStore{configMapName='thoros-00000000000000000000000000000000-jobmanager-leader'}. 2021-08-31 15:00:02,784 INFO  org.apache.flink.runtime.checkpoint.DefaultCompletedCheckpointStore [] - Found 0 checkpoints in KubernetesStateHandleStore{configMapName='thoros-00000000000000000000000000000000-jobmanager-leader'}. 2021-08-31 15:00:02,784 INFO  org.apache.flink.runtime.checkpoint.DefaultCompletedCheckpointStore [] - All 0 checkpoints found are already downloaded. 2021-08-31 15:00:02,784 INFO  org.apache.flink.runtime.checkpoint.CheckpointCoordinator  [] - No checkpoint found during restore.


--
*Med Vänliga Hälsningar*
/Jonas Eyob/


Reply via email to