[ https://issues.apache.org/jira/browse/FLINK-7340?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16204715#comment-16204715 ]
Stephan Ewen edited comment on FLINK-7340 at 10/14/17 4:27 PM: --------------------------------------------------------------- Thanks, that is a good pointer! [~till.rohrmann] Can we take this into account in the Akka / HA address management? was (Author: stephanewen): Thanks, that is a good pointer! [~till.rohrmann] Can we take this into account in the Akka / HA address management > Taskmanager hung after temporary DNS outage > ------------------------------------------- > > Key: FLINK-7340 > URL: https://issues.apache.org/jira/browse/FLINK-7340 > Project: Flink > Issue Type: Bug > Components: Core, Distributed Coordination > Affects Versions: 1.3.1 > Environment: Non-HA Flink running in Kubernetes. > Reporter: Joshua Griffith > > After a Kubernetes node failure, several TaskManagers and the DNS system were > automatically restarted. One TaskManager was unable to connect to the > JobManager and continually logged the following errors: > {quote} > 2017-08-01 18:58:06.707 [flink-akka.actor.default-dispatcher-823] INFO > org.apache.flink.runtime.taskmanager.TaskManager - Trying to register at > JobManager akka.tcp://flink@jobmanager:6123/user/jobmanager (attempt 595, > timeout: 30000 milliseconds) > 2017-08-01 18:58:06.713 [flink-akka.actor.default-dispatcher-834] INFO > Remoting flink-akka.remote.default-remote-dispatcher-240 - Quarantined > address [akka.tcp://flink@jobmanager:6123] is still unreachable or has not > been restarted. Keeping it quarantined. > {quote} > After exec'ing into the container, I was able to {{telnet jobmanager 6123}} > successfully and {{dig jobmanager}} showed the correct IP in DNS. I suspect > that the TaskManager cached a bad IP address for the JobManager when the DNS > system was restarting and it used that cached address rather than respecting > the 30s TTL and getting a new one for the next request. It may be a good idea > for the TaskManager to explicitly perform a DNS lookup after JobManager > connection failures. -- This message was sent by Atlassian JIRA (v6.4.14#64029)