Niklas Semmler created FLINK-26503:
--------------------------------------
Summary: Multi-network deployments may lead to hard-to-debug issues
Key: FLINK-26503
URL: https://issues.apache.org/jira/browse/FLINK-26503
Project: Flink
Issue Type: Improvement
Components: Runtime / Coordination
Affects Versions: 1.15.0
Reporter: Niklas Semmler
The way the TaskManager defines it's receiving network address may create
hard-to-debug issues for deployment setups where a TaskManager can reach a
JobManager over more than one interface. Additionally. the following conditions
have to be met:
* The JobManager is unreachable for some time on the TaskManager's start-up
* The receiving address {{taskmanager.host}} is not specified while the
{{taskmanager.bind-host}} is set to a concrete value (and not {{0.0.0.0}}).
*Background*
A TaskManager has two settings to configure its network:
{{taskmanager.bind-host}} and {{taskmanager.host}}.
* {{taskmanager.bind-host}} decides to what interfaces the TaskManager binds on
start-up. When not set it defaults to {{0.0.0.0}}, the most permissive
setting). With FLINK-24474, this setting is now set to {{localhost}} via the
configuration file. This line is disabled for docker setups.
* {{taskmanager.host}} specifies what the TaskManager announces as receiving
address (i.e., where it wants to be reached by the JobManager). When not set,
the TaskManager cycles through the available interfaces (a slight
simplification) and attempts to connect to the JobManager via each interface.
It chooses the address of the interface that allows a successful connection. If
no connection can be made before a timeout it chooses the address of the
hostname.
In network scenarios with multiple interfaces connecting TaskManager to the
JobManager, a user could specify an address & interface with
{{taskmanager.bind-host}} that is different to the address & network defined by
the {{taskmanager.host}}.
*Example*
A straightforward way to create such a scenario is to set the
{{taskmanager.bind-host}} to {{localhost}}. In this situation (and under the
conditions mentioned above), the TaskManager will fall back to the external IP
address of the hostname while binding to the localhost. This means that the
JobManager will be able to receive packets from the TaskManager, but not
respond.
*Proposed solution*
Can we help users debug these scenarios, by adding log messages? Could we note
that the IP address of the TaskManager is not on the JobManager's network (as
defined by the interface the JobManager binds to) and give this as a hint?
--
This message was sent by Atlassian Jira
(v8.20.1#820001)