Re: High availability - leader election not working?

Chesnay Schepler Tue, 07 Sep 2021 01:11:55 -0700

> How do I identify the leader job manager? (see the third point belowon approach to date)

Either through the config maps, or the logs. Your approach for theconfig maps _should_ be correct. There should be an "address" which ineach ConfigMap pointing to the current leader.

> How is the data stored at high-availability.storageDir used when theLeader is killed?

The storage dir contains the job graphs that were submitted by client,such that on failover the new JM knows which jobs it should start up again.


> How would I best test if the HA setup is working?

Killing the leader is the best way to test it, while it is of course arather destructive mean to do so it also tests the very case you areinterested in.A weaker test would be submitting a job to the non-leading JM, whichshould forward it to the leader.


On 01/09/2021 19:53, jonas eyob wrote:

Hey all,
I have a 2 Job Manager 1 Task Manager (2 slots) setup. Wanted tosimply try to see if the leader election would work correctly.
We are using:
* Standalone Application Cluster setup on Kubernetes, and havefollowed the example configurations provided in the documentation for HA.* Using theorg.apache.flink.kubernetes.highavailability.KubernetesHaServicesFactory* Using an s3 bucket for high-avalability.storageDir (with the prestoplugin).* Beam (with flink executor) and are consuming events from a Kinesisstream.
When starting up, I see a lot of activity in both JobManagers invokingthe election service, and eventually, the task manager is able toconnect to the leader. Running also verifies that the taskmanager isconsuming from the stream.
Now there are a few questions:
- How do I identify the leader job manager? (see the third point belowon approach to date)- How is the data stored at high-availability.storageDir used when theLeader is killed? From what I see in the S3 bucket we have a filecalled "submittedJobGraph716ad4dcd04e" - I understood the new leaderwould restore from this?- Looking at the configmaps created as part of the HA:dispatch-leader, restserver-leader, ..., I understand it as theJobManager leader-address should be present in these? But doesn't looklike it?- How would I best test if the HA setup is working? I have so fartried killing one of the JobManagers that I thought was the leaderusing (kubectl exec PODNAME -- /bin/sh -c "kill 1") -- is there abetter way?
LOGS from Job Manager 1
------------------------------------------------------------
2021-08-31 14:45:44,159 INFO org.apache.flink.runtime.jobmaster.MiniDispatcherRestEndpoint [] -Rest endpoint listening at 100.117.0.6:8081 <http://100.117.0.6:8081>2021-08-31 14:45:44,953 INFO org.apache.flink.kubernetes.kubeclient.resources.KubernetesLeaderElector[] - Create KubernetesLeaderElector thoros-restserver-leader with lockidentity 04658651-a7a9-47cb-aa51-2024343607fe.2021-08-31 14:45:47,791 INFO org.apache.flink.kubernetes.kubeclient.resources.KubernetesLeaderElector[] - New leader elected 04658651-a7a9-47cb-aa51-2024343607fe forthoros-restserver-leader.2021-08-31 14:45:48,217 INFO org.apache.flink.runtime.leaderelection.DefaultLeaderElectionService[] - Starting DefaultLeaderElectionService withKubernetesLeaderElectionDriver{configMapName='thoros-restserver-leader'}.2021-08-31 14:45:48,246 INFO org.apache.flink.runtime.jobmaster.MiniDispatcherRestEndpoint [] -Web frontend listening at http://100.117.0.6:8081<http://100.117.0.6:8081>.2021-08-31 14:45:48,261 INFO org.apache.flink.runtime.jobmaster.MiniDispatcherRestEndpoint [] -http://100.117.0.6:8081 <http://100.117.0.6:8081> was grantedleadership with leaderSessionID=64f52670-fe1b-452a-babe-c5fd94b381332021-08-31 14:45:48,618 INFO org.apache.flink.runtime.rpc.akka.AkkaRpcService [] - Starting RPCendpoint fororg.apache.flink.runtime.resourcemanager.StandaloneResourceManager atakka://flink/user/rpc/resourcemanager_0 .2021-08-31 14:45:48,689 INFO org.apache.flink.kubernetes.kubeclient.resources.KubernetesLeaderElector[] - Create KubernetesLeaderElector thoros-dispatcher-leader with lockidentity 04658651-a7a9-47cb-aa51-2024343607fe.2021-08-31 14:45:48,785 INFO org.apache.flink.runtime.leaderelection.DefaultLeaderElectionService[] - Starting DefaultLeaderElectionService withKubernetesLeaderElectionDriver{configMapName='thoros-dispatcher-leader'}.2021-08-31 14:45:48,793 INFO org.apache.flink.kubernetes.kubeclient.resources.KubernetesLeaderElector[] - Create KubernetesLeaderElector thoros-resourcemanager-leader withlock identity 04658651-a7a9-47cb-aa51-2024343607fe.2021-08-31 14:45:49,088 INFO org.apache.flink.runtime.leaderretrieval.DefaultLeaderRetrievalService[] - Starting DefaultLeaderRetrievalService withKubernetesLeaderRetrievalDriver{configMapName='thoros-resourcemanager-leader'}.2021-08-31 14:45:49,256 INFO org.apache.flink.kubernetes.kubeclient.resources.KubernetesLeaderElector[] - New leader elected 95fa9050-a8c9-44bb-bf5a-7322af43ea9d forthoros-dispatcher-leader.2021-08-31 14:45:49,280 INFO org.apache.flink.runtime.leaderretrieval.DefaultLeaderRetrievalService[] - Starting DefaultLeaderRetrievalService withKubernetesLeaderRetrievalDriver{configMapName='thoros-dispatcher-leader'}.2021-08-31 14:45:49,376 INFO org.apache.flink.runtime.leaderelection.DefaultLeaderElectionService[] - Starting DefaultLeaderElectionService withKubernetesLeaderElectionDriver{configMapName='thoros-resourcemanager-leader'}.2021-08-31 14:45:49,465 INFO org.apache.flink.kubernetes.kubeclient.resources.KubernetesLeaderElector[] - New leader elected 04658651-a7a9-47cb-aa51-2024343607fe forthoros-resourcemanager-leader.2021-08-31 14:45:49,480 INFO org.apache.flink.runtime.resourcemanager.StandaloneResourceManager []- ResourceManagerakka.tcp://[email protected]:6123/user/rpc/resourcemanager_0<http://[email protected]:6123/user/rpc/resourcemanager_0> was grantedleadership with fencing token 967f8e81eab9fc2078d062ad11b1405c2021-08-31 14:45:49,493 INFO org.apache.flink.runtime.resourcemanager.slotmanager.SlotManagerImpl[] - Starting the SlotManager.2021-08-31 14:45:54,049 WARN akka.remote.transport.netty.NettyTransport [] - Remote connection to[null] failed with java.net.NoRouteToHostException: No route to host2021-08-31 14:45:54,064 WARN akka.remote.transport.netty.NettyTransport [] - Remote connection to[null] failed with java.net.NoRouteToHostException: No route to host2021-08-31 14:45:54,113 WARN akka.remote.ReliableDeliverySupervisor[] - Association with remote system[akka.tcp://[email protected]:6123 <http://[email protected]:6123>]has failed, address is now gated for [50] ms. Reason: [Associationfailed with [akka.tcp://[email protected]:6123<http://[email protected]:6123>]] Caused by:[java.net.NoRouteToHostException: No route to host]2021-08-31 14:45:54,321 WARN akka.remote.ReliableDeliverySupervisor[] - Association with remote system [akka.tcp://[email protected]:6123<http://[email protected]:6123>] has failed, address is now gated for[50] ms. Reason: [Association failed with[akka.tcp://[email protected]:6123 <http://[email protected]:6123>]]Caused by: [java.net.NoRouteToHostException: No route to host]2021-08-31 14:45:55,789 INFO org.apache.flink.runtime.resourcemanager.StandaloneResourceManager []- Registering TaskManager with ResourceID 100.117.0.4:6122-69262a(akka.tcp://[email protected]:6122/user/rpc/taskmanager_0<http://[email protected]:6122/user/rpc/taskmanager_0>) at ResourceManager2021-08-31 14:45:55,890 INFO org.apache.flink.runtime.resourcemanager.StandaloneResourceManager []- Registering TaskManager with ResourceID 100.117.0.4:6122-69262a(akka.tcp://[email protected]:6122/user/rpc/taskmanager_0<http://[email protected]:6122/user/rpc/taskmanager_0>) at ResourceManager2021-08-31 14:45:59,988 INFO org.apache.flink.runtime.leaderretrieval.DefaultLeaderRetrievalService[] - Starting DefaultLeaderRetrievalService withKubernetesLeaderRetrievalDriver{configMapName='thoros-00000000000000000000000000000000-jobmanager-leader'}.2021-08-31 14:45:59,988 INFO org.apache.flink.runtime.resourcemanager.StandaloneResourceManager []- Registering job manager[email protected]://[email protected]:6123/user/rpc/jobmanager_2<http://[email protected]:6123/user/rpc/jobmanager_2> for job00000000000000000000000000000000.2021-08-31 14:46:00,076 INFO org.apache.flink.runtime.resourcemanager.StandaloneResourceManager []- Registering job manager[email protected]://[email protected]:6123/user/rpc/jobmanager_2<http://[email protected]:6123/user/rpc/jobmanager_2> for job00000000000000000000000000000000.2021-08-31 14:46:00,294 INFO org.apache.flink.runtime.resourcemanager.StandaloneResourceManager []- Registering job manager[email protected]://[email protected]:6123/user/rpc/jobmanager_2<http://[email protected]:6123/user/rpc/jobmanager_2> for job00000000000000000000000000000000.2021-08-31 14:46:00,753 INFO org.apache.flink.runtime.resourcemanager.StandaloneResourceManager []- Registering job manager[email protected]://[email protected]:6123/user/rpc/jobmanager_2<http://[email protected]:6123/user/rpc/jobmanager_2> for job00000000000000000000000000000000.2021-08-31 14:46:01,519 INFO org.apache.flink.runtime.resourcemanager.StandaloneResourceManager []- Registering job manager[email protected]://[email protected]:6123/user/rpc/jobmanager_2<http://[email protected]:6123/user/rpc/jobmanager_2> for job00000000000000000000000000000000.2021-08-31 14:46:03,084 INFO org.apache.flink.runtime.resourcemanager.StandaloneResourceManager []- Registered job manager[email protected]://[email protected]:6123/user/rpc/jobmanager_2<http://[email protected]:6123/user/rpc/jobmanager_2> for job00000000000000000000000000000000.2021-08-31 14:46:03,094 INFO org.apache.flink.runtime.resourcemanager.StandaloneResourceManager []- Registered job manager[email protected]://[email protected]:6123/user/rpc/jobmanager_2<http://[email protected]:6123/user/rpc/jobmanager_2> for job00000000000000000000000000000000.2021-08-31 14:46:03,102 INFO org.apache.flink.runtime.resourcemanager.StandaloneResourceManager []- Registered job manager[email protected]://[email protected]:6123/user/rpc/jobmanager_2<http://[email protected]:6123/user/rpc/jobmanager_2> for job00000000000000000000000000000000.2021-08-31 14:46:03,109 INFO org.apache.flink.runtime.resourcemanager.StandaloneResourceManager []- Registered job manager[email protected]://[email protected]:6123/user/rpc/jobmanager_2<http://[email protected]:6123/user/rpc/jobmanager_2> for job00000000000000000000000000000000.2021-08-31 14:46:03,118 INFO org.apache.flink.runtime.resourcemanager.StandaloneResourceManager []- Registered job manager[email protected]://[email protected]:6123/user/rpc/jobmanager_2<http://[email protected]:6123/user/rpc/jobmanager_2> for job00000000000000000000000000000000.2021-08-31 14:46:03,186 INFO org.apache.flink.runtime.resourcemanager.StandaloneResourceManager []- Request slot with profile ResourceProfile{UNKNOWN} for job00000000000000000000000000000000 with allocation idc57a30bfc082abe9355c6ae4bb1b4c8a.2021-08-31 14:46:03,279 INFO org.apache.flink.runtime.resourcemanager.StandaloneResourceManager []- Request slot with profile ResourceProfile{UNKNOWN} for job00000000000000000000000000000000 with allocation idc07339fbf640585c403edba93be5b09c.2021-08-31 14:59:18,486 WARN akka.remote.ReliableDeliverySupervisor[] - Association with remote system [akka.tcp://[email protected]:6122<http://[email protected]:6122>] has failed, address is now gated for[50] ms. Reason: [Disassociated]2021-08-31 14:59:21,411 WARN akka.remote.transport.netty.NettyTransport [] - Remote connection to[null] failed with java.net.ConnectException: Connection refused:/100.117.0.4:6122 <http://100.117.0.4:6122>2021-08-31 14:59:21,450 WARN akka.remote.ReliableDeliverySupervisor[] - Association with remote system [akka.tcp://[email protected]:6122<http://[email protected]:6122>] has failed, address is now gated for[50] ms. Reason: [Association failed with[akka.tcp://[email protected]:6122 <http://[email protected]:6122>]]Caused by: [java.net.ConnectException: Connection refused:/100.117.0.4:6122 <http://100.117.0.4:6122>]2021-08-31 14:59:31,447 WARN akka.remote.transport.netty.NettyTransport [] - Remote connection to[null] failed with java.net.ConnectException: Connection refused:/100.117.0.4:6122 <http://100.117.0.4:6122>2021-08-31 14:59:31,453 WARN akka.remote.ReliableDeliverySupervisor[] - Association with remote system [akka.tcp://[email protected]:6122<http://[email protected]:6122>] has failed, address is now gated for[50] ms. Reason: [Association failed with[akka.tcp://[email protected]:6122 <http://[email protected]:6122>]]Caused by: [java.net.ConnectException: Connection refused:/100.117.0.4:6122 <http://100.117.0.4:6122>]2021-08-31 14:59:34,914 INFO org.apache.flink.runtime.resourcemanager.StandaloneResourceManager []- Registering TaskManager with ResourceID 100.117.0.4:6122-9f2331(akka.tcp://[email protected]:6122/user/rpc/taskmanager_0<http://[email protected]:6122/user/rpc/taskmanager_0>) at ResourceManager2021-08-31 15:00:01,405 INFO org.apache.flink.runtime.resourcemanager.StandaloneResourceManager []- The heartbeat of TaskManager with id 100.117.0.4:6122-69262a timed out.2021-08-31 15:00:01,406 INFO org.apache.flink.runtime.resourcemanager.StandaloneResourceManager []- Closing TaskExecutor connection 100.117.0.4:6122-69262a because: Theheartbeat of TaskManager with id 100.117.0.4:6122-69262a timed out.2021-08-31 15:00:02,793 INFO org.apache.flink.runtime.resourcemanager.StandaloneResourceManager []- Request slot with profile ResourceProfile{UNKNOWN} for job00000000000000000000000000000000 with allocation idec3e0def8314dea150bb281818828b55.2021-08-31 15:00:02,796 INFO org.apache.flink.runtime.resourcemanager.StandaloneResourceManager []- Request slot with profile ResourceProfile{UNKNOWN} for job00000000000000000000000000000000 with allocation id2b8d9ba4a7c7895dbb0dba4f16006441.
Excerpt of LOGS after (what I think is killing the leader) a newJobManager ("JobManager 2") pod is created to replace the killed one
------------------------------------------------------------
2021-08-31 15:00:02,778 INFO org.apache.flink.runtime.checkpoint.DefaultCompletedCheckpointStore[] - Recovering checkpoints fromKubernetesStateHandleStore{configMapName='thoros-00000000000000000000000000000000-jobmanager-leader'}.2021-08-31 15:00:02,784 INFO org.apache.flink.runtime.checkpoint.DefaultCompletedCheckpointStore[] - Found 0 checkpoints inKubernetesStateHandleStore{configMapName='thoros-00000000000000000000000000000000-jobmanager-leader'}.2021-08-31 15:00:02,784 INFO org.apache.flink.runtime.checkpoint.DefaultCompletedCheckpointStore[] - All 0 checkpoints found are already downloaded.2021-08-31 15:00:02,784 INFO org.apache.flink.runtime.checkpoint.CheckpointCoordinator [] - Nocheckpoint found during restore.
--
*Med Vänliga Hälsningar*
/Jonas Eyob/

Re: High availability - leader election not working?

Reply via email to