Hi, During an offline discussion, Felix brought up the suggestion to lower the topology connector's heartbeat frequency. Currently they are sent every 15 or 30 sec, which might seem a lot - especially as they were way too chatty (which is fixed now with SLING-3377).
The main reason for having a high heartbeat frequency is quicker failure detection - but it's obviously a trade-off as it increases load. I would like to get some opinion on to the following proposal: * introduce two different sets of heartbeats, one for repository and one for connectors * the repository ones would remain at the current frequency (suggested default: 30sec interval, 60sec timeout). The idea is that we would want to detect crashes within a cluster rather quickly, more quickly than in the topology in general. * the connectors would get a back-off behavior, where initially the values are the same (30sec/60sec) but then they send out less frequent heartbeats over time, reaching a max (eg 5min). This would have to be controlled by the receiving side, ie both sides of the connector have to agree that interval and timeout are the same. I've opened a Jira to track this, please comment there: https://issues.apache.org/jira/browse/SLING-3382 Thanks, Cheers, Stefan