Hi community,
I am interested in knowing more about the failure detection mechanism
used by Flink, unfortunately information is a little thin on the ground
and I was hoping someone could shed a little light on the topic.
Looking at the documentation
(https://ci.apache.org/projects/flink/flink-docs-stable/ops/config.html),
there are these two configuration options:
heartbeat.interval
10000 Long Time interval for requesting heartbeat from sender side.
heartbeat.timeout
50000 Long Timeout for requesting and receiving heartbeat for both
sender and receiver sides.
This would indicate Flink uses a heartbeat mechanism to ascertain the
liveness of TaskManagers. From this the following assumptions are made:
The JobManager is responsible for broadcasting a heartbeat requests to
all TaskManagers and awaits responses.
If a response is not forthcoming from any particular node within the
heartbeat timeout period, e.g. 50 seconds by default, then that node is
timed out and assumed to have failed.
The heartbeat interval indicated how often the heartbeat request
broadcast is scheduled.
Having the heartbeat interval shorter than the heartbeat timeout would
mean that multiple requests can be underway at the same time.
Therefore, the TaskManager would need to fail to respond to 4 requests
(assuming normal response times are lower than 10 seconds) before being
timed out after 50 seconds.
So therefore if a failure were to occur (considering the default settings):
- In the best case the JobManager would detect the failure in the
shortest time, i.e. 50 seconds +- (node fails just before receiving the
next heartbeat request)
- In the worst case the JobManager would detect the failure in the
longest time, i.e. 60 seconds +- (node fails just after sending the last
heartbeat response)
Is this correct?
For JobManagers in HA mode, this is left to ZooKeeper timeouts which
then initiates a round of elections and the new leader picks up from the
previous checkpoint.
Thank you in advance.
Regards,
M.