We started to see same errors after upgrading to flink 1.6.0 from 1.4.2. We have one JM and 5 TM on kubernetes. JM is running on HA mode. Taskmanagers sometimes are loosing connection to JM and having following error like you have.
*2018-09-19 12:36:40,687 INFO org.apache.flink.runtime.taskexecutor.TaskExecutor - Could not resolve ResourceManager address akka.tcp://flink@flink-jobmanager:50002/user/resourcemanager, retrying in 10000 ms: Ask timed out on [ActorSelection[Anchor(akka.tcp://flink@flink-jobmanager:50002/), Path(/user/resourcemanager)]] after [10000 ms]. Sender[null] sent message of type "akka.actor.Identify"..* When TM started to have "Could not resolve ResourceManager", it cannot resolve itself until I restart the TM pod. *Here is the content of our flink-conf.yaml:* blob.server.port: 6124 jobmanager.rpc.address: flink-jobmanager jobmanager.rpc.port: 6123 jobmanager.heap.mb: 4096 jobmanager.web.history: 20 jobmanager.archive.fs.dir: s3://our_path taskmanager.rpc.port: 6121 taskmanager.heap.mb: 16384 taskmanager.numberOfTaskSlots: 10 taskmanager.log.path: /opt/flink/log/output.log web.log.path: /opt/flink/log/output.log state.checkpoints.num-retained: 3 metrics.reporters: prom metrics.reporter.prom.class: org.apache.flink.metrics.prometheus.PrometheusReporter high-availability: zookeeper high-availability.jobmanager.port: 50002 high-availability.zookeeper.quorum: zookeeper_instance_list high-availability.zookeeper.path.root: /flink high-availability.cluster-id: profileservice high-availability.storageDir: s3://our_path Any help will be greatly appreciated! -- Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/