Failure detection and Heartbeats

Morgan Geldenhuys Tue, 10 Mar 2020 06:54:54 -0700

Hi community,

I am interested in knowing more about the failure detection mechanismused by Flink, unfortunately information is a little thin on the groundand I was hoping someone could shed a little light on the topic.

Looking at the documentation(https://ci.apache.org/projects/flink/flink-docs-stable/ops/config.html),there are these two configuration options:



         heartbeat.interval

        10000   Long    Time interval for requesting heartbeat from sender side.


         heartbeat.timeout

50000 Long Timeout for requesting and receiving heartbeat for bothsender and receiver sides.

This would indicate Flink uses a heartbeat mechanism to ascertain theliveness of TaskManagers. From this the following assumptions are made:

The JobManager is responsible for broadcasting a heartbeat requests toall TaskManagers and awaits responses.If a response is not forthcoming from any particular node within theheartbeat timeout period, e.g. 50 seconds by default, then that node istimed out and assumed to have failed.The heartbeat interval indicated how often the heartbeat requestbroadcast is scheduled.Having the heartbeat interval shorter than the heartbeat timeout wouldmean that multiple requests can be underway at the same time.Therefore, the TaskManager would need to fail to respond to 4 requests(assuming normal response times are lower than 10 seconds) before beingtimed out after 50 seconds.


So therefore if a failure were to occur (considering the default settings):

- In the best case the JobManager would detect the failure in theshortest time, i.e. 50 seconds +- (node fails just before receiving thenext heartbeat request)- In the worst case the JobManager would detect the failure in thelongest time, i.e. 60 seconds +- (node fails just after sending the lastheartbeat response)


Is this correct?

For JobManagers in HA mode, this is left to ZooKeeper timeouts whichthen initiates a round of elections and the new leader picks up from theprevious checkpoint.


Thank you in advance.

Regards,
M.

Failure detection and Heartbeats

Reply via email to