I'm running Flink 1.7.2 in a Docker swarm. Intermittently, new task managers 
will fail to resolve their own host names when starting up. In the log I see 
"no hostname could be resolved" messages coming from TaskManagerLocation. The 
webUI on the jobmanager shows the taskmanagers as are associated/connected with 
the jobmanager, but their akka paths show their IP, rather than the container 
name that 'good' taskmanager show. Those taskmanagers that are listed by IP 
give 'failed to connect' errors when new jobs are started that try to use those 
taskmanagers, and that job eventually fails. But the taskmanagers with this 
condition still give regular heartbeats to the Jobmanager, so the jobmanager 
keeps trying to assign work to them. Does anyone know what's going on here?

Reply via email to