Re: High availability - leader election not working?

Arvid Heise Mon, 06 Sep 2021 09:16:43 -0700

Hi Jonas,

I immediately see some network issues


2021-08-31 14:45:54,113 WARN  akka.remote.ReliableDeliverySupervisor
                [] - Association with remote system [akka.tcp://
[email protected]:6123] has failed, address is now gated for [50] ms.
Reason: [Association failed with [akka.tcp://[email protected]:6123]]
Caused by: [java.net.NoRouteToHostException: No route to host]
...
2021-08-31 14:59:21,411 WARN  akka.remote.transport.netty.NettyTransport
                [] - Remote connection to [null] failed with
java.net.ConnectException: Connection refused: /100.117.0.4:6122

Can you make sure that each task manager reaches the job manager and each
task manager can reach all other task managers?

On Wed, Sep 1, 2021 at 7:53 PM jonas eyob <[email protected]> wrote:

> Hey all,
>
> I have a 2 Job Manager 1 Task Manager (2 slots) setup. Wanted to simply
> try to see if the leader election would work correctly.
>
> We are using:
> * Standalone Application Cluster setup on Kubernetes, and have followed
> the example configurations provided in the documentation for HA.
> * Using the
> org.apache.flink.kubernetes.highavailability.KubernetesHaServicesFactory
> * Using an s3 bucket for high-avalability.storageDir (with the presto
> plugin).
> * Beam (with flink executor) and are consuming events from a Kinesis
> stream.
>
> When starting up, I see a lot of activity in both JobManagers invoking the
> election service, and eventually, the task manager is able to connect to
> the leader. Running also verifies that the taskmanager is consuming from
> the stream.
>
> Now there are a few questions:
> - How do I identify the leader job manager? (see the third point below on
> approach to date)
> - How is the data stored at high-availability.storageDir used when the
> Leader is killed? From what I see in the S3 bucket we have a file called
> "submittedJobGraph716ad4dcd04e" - I understood the new leader would restore
> from this?
> - Looking at the configmaps created as part of the HA: dispatch-leader,
> restserver-leader, ..., I understand it as the JobManager leader-address
> should be present in these? But doesn't look like it?
> - How would I best test if the HA setup is working? I have so far tried
> killing one of the JobManagers that I thought was the leader using (kubectl
> exec PODNAME -- /bin/sh -c "kill 1") -- is there a better way?
>
> LOGS from Job Manager 1
> ------------------------------------------------------------
> 2021-08-31 14:45:44,159 INFO
>  org.apache.flink.runtime.jobmaster.MiniDispatcherRestEndpoint [] - Rest
> endpoint listening at 100.117.0.6:8081
> 2021-08-31 14:45:44,953 INFO
>  org.apache.flink.kubernetes.kubeclient.resources.KubernetesLeaderElector
> [] - Create KubernetesLeaderElector thoros-restserver-leader with lock
> identity 04658651-a7a9-47cb-aa51-2024343607fe.
> 2021-08-31 14:45:47,791 INFO
>  org.apache.flink.kubernetes.kubeclient.resources.KubernetesLeaderElector
> [] - New leader elected 04658651-a7a9-47cb-aa51-2024343607fe for
> thoros-restserver-leader.
> 2021-08-31 14:45:48,217 INFO
>  org.apache.flink.runtime.leaderelection.DefaultLeaderElectionService [] -
> Starting DefaultLeaderElectionService with
> KubernetesLeaderElectionDriver{configMapName='thoros-restserver-leader'}.
> 2021-08-31 14:45:48,246 INFO
>  org.apache.flink.runtime.jobmaster.MiniDispatcherRestEndpoint [] - Web
> frontend listening at http://100.117.0.6:8081.
> 2021-08-31 14:45:48,261 INFO
>  org.apache.flink.runtime.jobmaster.MiniDispatcherRestEndpoint [] -
> http://100.117.0.6:8081 was granted leadership with
> leaderSessionID=64f52670-fe1b-452a-babe-c5fd94b38133
> 2021-08-31 14:45:48,618 INFO
>  org.apache.flink.runtime.rpc.akka.AkkaRpcService             [] - Starting
> RPC endpoint for
> org.apache.flink.runtime.resourcemanager.StandaloneResourceManager at
> akka://flink/user/rpc/resourcemanager_0 .
> 2021-08-31 14:45:48,689 INFO
>  org.apache.flink.kubernetes.kubeclient.resources.KubernetesLeaderElector
> [] - Create KubernetesLeaderElector thoros-dispatcher-leader with lock
> identity 04658651-a7a9-47cb-aa51-2024343607fe.
> 2021-08-31 14:45:48,785 INFO
>  org.apache.flink.runtime.leaderelection.DefaultLeaderElectionService [] -
> Starting DefaultLeaderElectionService with
> KubernetesLeaderElectionDriver{configMapName='thoros-dispatcher-leader'}.
> 2021-08-31 14:45:48,793 INFO
>  org.apache.flink.kubernetes.kubeclient.resources.KubernetesLeaderElector
> [] - Create KubernetesLeaderElector thoros-resourcemanager-leader with lock
> identity 04658651-a7a9-47cb-aa51-2024343607fe.
> 2021-08-31 14:45:49,088 INFO
>  org.apache.flink.runtime.leaderretrieval.DefaultLeaderRetrievalService []
> - Starting DefaultLeaderRetrievalService with
> KubernetesLeaderRetrievalDriver{configMapName='thoros-resourcemanager-leader'}.
> 2021-08-31 14:45:49,256 INFO
>  org.apache.flink.kubernetes.kubeclient.resources.KubernetesLeaderElector
> [] - New leader elected 95fa9050-a8c9-44bb-bf5a-7322af43ea9d for
> thoros-dispatcher-leader.
> 2021-08-31 14:45:49,280 INFO
>  org.apache.flink.runtime.leaderretrieval.DefaultLeaderRetrievalService []
> - Starting DefaultLeaderRetrievalService with
> KubernetesLeaderRetrievalDriver{configMapName='thoros-dispatcher-leader'}.
> 2021-08-31 14:45:49,376 INFO
>  org.apache.flink.runtime.leaderelection.DefaultLeaderElectionService [] -
> Starting DefaultLeaderElectionService with
> KubernetesLeaderElectionDriver{configMapName='thoros-resourcemanager-leader'}.
> 2021-08-31 14:45:49,465 INFO
>  org.apache.flink.kubernetes.kubeclient.resources.KubernetesLeaderElector
> [] - New leader elected 04658651-a7a9-47cb-aa51-2024343607fe for
> thoros-resourcemanager-leader.
> 2021-08-31 14:45:49,480 INFO
>  org.apache.flink.runtime.resourcemanager.StandaloneResourceManager [] -
> ResourceManager akka.tcp://
> [email protected]:6123/user/rpc/resourcemanager_0 was granted leadership
> with fencing token 967f8e81eab9fc2078d062ad11b1405c
> 2021-08-31 14:45:49,493 INFO
>  org.apache.flink.runtime.resourcemanager.slotmanager.SlotManagerImpl [] -
> Starting the SlotManager.
> 2021-08-31 14:45:54,049 WARN  akka.remote.transport.netty.NettyTransport
>                 [] - Remote connection to [null] failed with
> java.net.NoRouteToHostException: No route to host
> 2021-08-31 14:45:54,064 WARN  akka.remote.transport.netty.NettyTransport
>                 [] - Remote connection to [null] failed with
> java.net.NoRouteToHostException: No route to host
> 2021-08-31 14:45:54,113 WARN  akka.remote.ReliableDeliverySupervisor
>                 [] - Association with remote system [akka.tcp://
> [email protected]:6123] has failed, address is now gated for [50] ms.
> Reason: [Association failed with [akka.tcp://[email protected]:6123]]
> Caused by: [java.net.NoRouteToHostException: No route to host]
> 2021-08-31 14:45:54,321 WARN  akka.remote.ReliableDeliverySupervisor
>                 [] - Association with remote system [akka.tcp://
> [email protected]:6123] has failed, address is now gated for [50] ms.
> Reason: [Association failed with [akka.tcp://[email protected]:6123]]
> Caused by: [java.net.NoRouteToHostException: No route to host]
> 2021-08-31 14:45:55,789 INFO
>  org.apache.flink.runtime.resourcemanager.StandaloneResourceManager [] -
> Registering TaskManager with ResourceID 100.117.0.4:6122-69262a
> (akka.tcp://[email protected]:6122/user/rpc/taskmanager_0) at
> ResourceManager
> 2021-08-31 14:45:55,890 INFO
>  org.apache.flink.runtime.resourcemanager.StandaloneResourceManager [] -
> Registering TaskManager with ResourceID 100.117.0.4:6122-69262a
> (akka.tcp://[email protected]:6122/user/rpc/taskmanager_0) at
> ResourceManager
> 2021-08-31 14:45:59,988 INFO
>  org.apache.flink.runtime.leaderretrieval.DefaultLeaderRetrievalService []
> - Starting DefaultLeaderRetrievalService with
> KubernetesLeaderRetrievalDriver{configMapName='thoros-00000000000000000000000000000000-jobmanager-leader'}.
> 2021-08-31 14:45:59,988 INFO
>  org.apache.flink.runtime.resourcemanager.StandaloneResourceManager [] -
> Registering job manager [email protected]://
> [email protected]:6123/user/rpc/jobmanager_2 for job
> 00000000000000000000000000000000.
> 2021-08-31 14:46:00,076 INFO
>  org.apache.flink.runtime.resourcemanager.StandaloneResourceManager [] -
> Registering job manager [email protected]://
> [email protected]:6123/user/rpc/jobmanager_2 for job
> 00000000000000000000000000000000.
> 2021-08-31 14:46:00,294 INFO
>  org.apache.flink.runtime.resourcemanager.StandaloneResourceManager [] -
> Registering job manager [email protected]://
> [email protected]:6123/user/rpc/jobmanager_2 for job
> 00000000000000000000000000000000.
> 2021-08-31 14:46:00,753 INFO
>  org.apache.flink.runtime.resourcemanager.StandaloneResourceManager [] -
> Registering job manager [email protected]://
> [email protected]:6123/user/rpc/jobmanager_2 for job
> 00000000000000000000000000000000.
> 2021-08-31 14:46:01,519 INFO
>  org.apache.flink.runtime.resourcemanager.StandaloneResourceManager [] -
> Registering job manager [email protected]://
> [email protected]:6123/user/rpc/jobmanager_2 for job
> 00000000000000000000000000000000.
> 2021-08-31 14:46:03,084 INFO
>  org.apache.flink.runtime.resourcemanager.StandaloneResourceManager [] -
> Registered job manager [email protected]://
> [email protected]:6123/user/rpc/jobmanager_2 for job
> 00000000000000000000000000000000.
> 2021-08-31 14:46:03,094 INFO
>  org.apache.flink.runtime.resourcemanager.StandaloneResourceManager [] -
> Registered job manager [email protected]://
> [email protected]:6123/user/rpc/jobmanager_2 for job
> 00000000000000000000000000000000.
> 2021-08-31 14:46:03,102 INFO
>  org.apache.flink.runtime.resourcemanager.StandaloneResourceManager [] -
> Registered job manager [email protected]://
> [email protected]:6123/user/rpc/jobmanager_2 for job
> 00000000000000000000000000000000.
> 2021-08-31 14:46:03,109 INFO
>  org.apache.flink.runtime.resourcemanager.StandaloneResourceManager [] -
> Registered job manager [email protected]://
> [email protected]:6123/user/rpc/jobmanager_2 for job
> 00000000000000000000000000000000.
> 2021-08-31 14:46:03,118 INFO
>  org.apache.flink.runtime.resourcemanager.StandaloneResourceManager [] -
> Registered job manager [email protected]://
> [email protected]:6123/user/rpc/jobmanager_2 for job
> 00000000000000000000000000000000.
> 2021-08-31 14:46:03,186 INFO
>  org.apache.flink.runtime.resourcemanager.StandaloneResourceManager [] -
> Request slot with profile ResourceProfile{UNKNOWN} for job
> 00000000000000000000000000000000 with allocation id
> c57a30bfc082abe9355c6ae4bb1b4c8a.
> 2021-08-31 14:46:03,279 INFO
>  org.apache.flink.runtime.resourcemanager.StandaloneResourceManager [] -
> Request slot with profile ResourceProfile{UNKNOWN} for job
> 00000000000000000000000000000000 with allocation id
> c07339fbf640585c403edba93be5b09c.
> 2021-08-31 14:59:18,486 WARN  akka.remote.ReliableDeliverySupervisor
>                 [] - Association with remote system [akka.tcp://
> [email protected]:6122] has failed, address is now gated for [50] ms.
> Reason: [Disassociated]
> 2021-08-31 14:59:21,411 WARN  akka.remote.transport.netty.NettyTransport
>                 [] - Remote connection to [null] failed with
> java.net.ConnectException: Connection refused: /100.117.0.4:6122
> 2021-08-31 14:59:21,450 WARN  akka.remote.ReliableDeliverySupervisor
>                 [] - Association with remote system [akka.tcp://
> [email protected]:6122] has failed, address is now gated for [50] ms.
> Reason: [Association failed with [akka.tcp://[email protected]:6122]]
> Caused by: [java.net.ConnectException: Connection refused: /
> 100.117.0.4:6122]
> 2021-08-31 14:59:31,447 WARN  akka.remote.transport.netty.NettyTransport
>                 [] - Remote connection to [null] failed with
> java.net.ConnectException: Connection refused: /100.117.0.4:6122
> 2021-08-31 14:59:31,453 WARN  akka.remote.ReliableDeliverySupervisor
>                 [] - Association with remote system [akka.tcp://
> [email protected]:6122] has failed, address is now gated for [50] ms.
> Reason: [Association failed with [akka.tcp://[email protected]:6122]]
> Caused by: [java.net.ConnectException: Connection refused: /
> 100.117.0.4:6122]
> 2021-08-31 14:59:34,914 INFO
>  org.apache.flink.runtime.resourcemanager.StandaloneResourceManager [] -
> Registering TaskManager with ResourceID 100.117.0.4:6122-9f2331
> (akka.tcp://[email protected]:6122/user/rpc/taskmanager_0) at
> ResourceManager
> 2021-08-31 15:00:01,405 INFO
>  org.apache.flink.runtime.resourcemanager.StandaloneResourceManager [] -
> The heartbeat of TaskManager with id 100.117.0.4:6122-69262a timed out.
> 2021-08-31 15:00:01,406 INFO
>  org.apache.flink.runtime.resourcemanager.StandaloneResourceManager [] -
> Closing TaskExecutor connection 100.117.0.4:6122-69262a because: The
> heartbeat of TaskManager with id 100.117.0.4:6122-69262a  timed out.
> 2021-08-31 15:00:02,793 INFO
>  org.apache.flink.runtime.resourcemanager.StandaloneResourceManager [] -
> Request slot with profile ResourceProfile{UNKNOWN} for job
> 00000000000000000000000000000000 with allocation id
> ec3e0def8314dea150bb281818828b55.
> 2021-08-31 15:00:02,796 INFO
>  org.apache.flink.runtime.resourcemanager.StandaloneResourceManager [] -
> Request slot with profile ResourceProfile{UNKNOWN} for job
> 00000000000000000000000000000000 with allocation id
> 2b8d9ba4a7c7895dbb0dba4f16006441.
>
> Excerpt of LOGS after (what I think is killing the leader) a new
> JobManager ("JobManager 2") pod is created to replace the killed one
> ------------------------------------------------------------
> 2021-08-31 15:00:02,778 INFO
>  org.apache.flink.runtime.checkpoint.DefaultCompletedCheckpointStore [] -
> Recovering checkpoints from
> KubernetesStateHandleStore{configMapName='thoros-00000000000000000000000000000000-jobmanager-leader'}.
> 2021-08-31 15:00:02,784 INFO
>  org.apache.flink.runtime.checkpoint.DefaultCompletedCheckpointStore [] -
> Found 0 checkpoints in
> KubernetesStateHandleStore{configMapName='thoros-00000000000000000000000000000000-jobmanager-leader'}.
> 2021-08-31 15:00:02,784 INFO
>  org.apache.flink.runtime.checkpoint.DefaultCompletedCheckpointStore [] -
> All 0 checkpoints found are already downloaded.
> 2021-08-31 15:00:02,784 INFO
>  org.apache.flink.runtime.checkpoint.CheckpointCoordinator    [] - No
> checkpoint found during restore.
>
>
> --
> *Med Vänliga Hälsningar*
> *Jonas Eyob*
>

Re: High availability - leader election not working?

Reply via email to