[ https://issues.apache.org/jira/browse/FLINK-21685?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
ASF GitHub Bot updated FLINK-21685: ----------------------------------- Labels: pull-request-available (was: ) > Flink JobManager failed to restart from checkpoint in kubernetes HA setup > ------------------------------------------------------------------------- > > Key: FLINK-21685 > URL: https://issues.apache.org/jira/browse/FLINK-21685 > Project: Flink > Issue Type: Bug > Components: Deployment / Kubernetes > Affects Versions: 1.12.1, 1.12.2 > Reporter: Peng Zhang > Assignee: Yang Wang > Priority: Major > Labels: pull-request-available > Attachments: 01-role.yaml, 02-role-binding.yaml, 03-config.yaml, > 06-jobmanager-deployment.yaml, 08-taskmanager-deployment.yaml, flink-ha.log, > jstack.jm.1, scalyr-logs.txt.zip > > > We use Flink K8S session cluster with HA mode (1 JobManager and 4 > TaskManagers). When jobs are running in Flink, and JobManager restarted, > Flink JobManager failed to recover job from checkpoint > {code} > 2021-03-08 13:16:42,962 INFO > org.apache.flink.runtime.checkpoint.DefaultCompletedCheckpointStore [] - > Trying to fetch 1 checkpoints from storage. > 2021-03-08 13:16:42,962 INFO > org.apache.flink.runtime.checkpoint.DefaultCompletedCheckpointStore [] - > Trying to fetch 1 checkpoints from storage. > 2021-03-08 13:16:42,962 INFO > org.apache.flink.runtime.checkpoint.DefaultCompletedCheckpointStore [] - > Trying to retrieve checkpoint 1. > 2021-03-08 13:16:43,014 INFO > org.apache.flink.runtime.checkpoint.CheckpointCoordinator [] - Restoring > job 9a534b2e309b24f78866b65d94082ead from Checkpoint 1 @ 1615208258041 for > 9a534b2e309b24f78866b65d94082ead located at > s3a://zalando-stellar-flink-state-eu-central-1-staging/checkpoints/9a534b2e309b24f78866b65d94082ead/chk-1. > > 2021-03-08 13:16:43,023 INFO > org.apache.flink.runtime.checkpoint.CheckpointCoordinator [] - No master > state to restore > 2021-03-08 13:16:43,024 INFO org.apache.flink.runtime.jobmaster.JobMaster > [] - Using failover strategy > org.apache.flink.runtime.executiongraph.failover.flip1.RestartPipelinedRegionFailoverStrategy@58d927d2 > for BrandCollectionTrackingJob (9a534b2e309b24f78866b65d94082ead). > 2021-03-08 13:16:43,046 INFO > org.apache.flink.runtime.jobmaster.JobManagerRunnerImpl [] - JobManager > runner for job BrandCollectionTrackingJob (9a534b2e309b24f78866b65d94082ead) > was granted leadership with session id c258d8ce-69d3-49df-8bee-1b748d5bbe74 > at akka.tcp://flink@10.2.179.12:6123/user/rpc/jobmanager_2. > 2021-03-08 13:16:43,060 WARN akka.remote.transport.netty.NettyTransport > [] - Remote connection to [null] failed with > java.net.NoRouteToHostException: No route to host > 2021-03-08 13:16:43,060 WARN akka.remote.ReliableDeliverySupervisor > [] - Association with remote system > [akka.tcp://flink@10.2.174.188:6123] has failed, address is now gated for > [50] ms. Reason: [Association failed with > [akka.tcp://flink@10.2.174.188:6123]] Caused by: > [java.net.NoRouteToHostException: No route to host] > {code} > Attached is the log, and our configuration. > -- This message was sent by Atlassian Jira (v8.3.4#803005)