rondagostino commented on code in PR #13759: URL: https://github.com/apache/kafka/pull/13759#discussion_r1206028552
########## metadata/src/main/java/org/apache/kafka/controller/BrokerHeartbeatManager.java: ########## @@ -223,6 +283,21 @@ public BrokerHeartbeatState next() { } } + /** + * The maximum number of timed out heartbeats to count. + */ + static final int DEFAULT_TIMED_OUT_HEARTBEAT_COUNT_MAX = 1000; + + /** + * The time period over which to track timed out heartbeats. + */ + static final long DEFAULT_TIMED_OUT_HEARTBEAT_COUNT_WINDOW_NS = TimeUnit.MINUTES.toNanos(5); + + /** + * The number of heartbeats to notice missing before we go into overload. + */ + static final int DEFAULT_TIMED_OUT_HEARTBEAT_OVERLOAD_THRESHOLD = 3; + Review Comment: Whatever we set this to (and currently we always take this default value) will determine how long it would take for a controller to leave the overloaded state, right? So if we see a series of missed heartbeats that puts us into the overloaded state, and then we don't see any more missed heartbeats, it would take this amount of time to leave that overloaded state? Consider the following case. We are overloaded. And then some broker crashes. How long would we want it to take before we move leadership away for any partitions led by that suddenly-dead broker? Ideally the session timeout, of course, but that won't happen if we are overloaded -- will it take 5 minutes as currently coded? That seems like a long time. I think the only thing we can say is that we should move leadership away as soon as we have a correct view of the cluster. The question then becomes: how do we know when we have good information? A fixed amount of time like this with no missed heartbeats -- that seems too simple. We likely have a reasonable view of the cluster if we see enough heartbeats without missing any such that it is likely that all brokers have had a chance to communicate with us. So it seems like maybe the broker heartbeat interval should factor into it somehow? Let's say the broker heartbeat interval is 2 seconds. Right now we consider a heartbeat missed if it is more than half the heartbeat interval (i.e. 1 second) old. Let's say there are 60 brokers in the cluster. If we see something on the order of 60 heartbeats in row without having missing any, then we probably have a reasonable view. So maybe it isn't having missed no more than N heartbeats in some fixed time window. Maybe we define "not overloaded" as having seen some string of heartbeats uninterrupted without seeing a missed one? So every time we miss a heartbeat we go into the "overloaded" state and we reset a counter of contiguous successfully processed heartbeats to 0. Whenever we see a heartbeat and process it in time we increase that counter. We are no longer overloaded when the `contiguousHeartbeatsProcessedInTime` counter is equal to or exceeds the broker count. I'm not sure I ave it right, but I think this is a discussion worth pursuing. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: jira-unsubscr...@kafka.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org