We started to see same errors after upgrading to flink 1.6.0 from 1.4.2. We
have one JM and 5 TM on kubernetes. JM is running on HA mode. Taskmanagers
sometimes are loosing connection to JM and having following error like you
have.

*2018-09-19 12:36:40,687 INFO 
org.apache.flink.runtime.taskexecutor.TaskExecutor            - Could not
resolve ResourceManager address
akka.tcp://flink@flink-jobmanager:50002/user/resourcemanager, retrying in
10000 ms: Ask timed out on
[ActorSelection[Anchor(akka.tcp://flink@flink-jobmanager:50002/),
Path(/user/resourcemanager)]] after [10000 ms]. Sender[null] sent message of
type "akka.actor.Identify"..*

When TM started to have "Could not resolve ResourceManager", it cannot
resolve itself until I restart the TM pod.

*Here is the content of our flink-conf.yaml:*
blob.server.port: 6124
jobmanager.rpc.address: flink-jobmanager
jobmanager.rpc.port: 6123
jobmanager.heap.mb: 4096
jobmanager.web.history: 20
jobmanager.archive.fs.dir: s3://our_path
taskmanager.rpc.port: 6121
taskmanager.heap.mb: 16384
taskmanager.numberOfTaskSlots: 10
taskmanager.log.path: /opt/flink/log/output.log
web.log.path: /opt/flink/log/output.log
state.checkpoints.num-retained: 3
metrics.reporters: prom
metrics.reporter.prom.class:
org.apache.flink.metrics.prometheus.PrometheusReporter

high-availability: zookeeper
high-availability.jobmanager.port: 50002
high-availability.zookeeper.quorum: zookeeper_instance_list
high-availability.zookeeper.path.root: /flink
high-availability.cluster-id: profileservice
high-availability.storageDir: s3://our_path

Any help will be greatly appreciated!



--
Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/

Reply via email to