cmccabe opened a new pull request, #13759: URL: https://github.com/apache/kafka/pull/13759
When the active KRaft controller is overloaded, it will not be able to process broker heartbeat requests. Instead, they will be timed out. When using the default configuration, this will happen if the time needed to process a broker heartbeat climbs above a second for a sustained period. This, in turn, could lead to brokers being improperly fenced when they are still alive. This PR creates a new state, the overload state, which the active controller will enter when it fails to process more than 3 heartbeats in a given 5 minute period. While in overload state, no brokers will be fenced due to missed heartbeats (because we don't know if the heartbeats were missed because of load, or another reason). Entering overload state increments the metadata error metric on the controller, which will notify the operator about the problem. We also log a message when we notice that we are out of overload state. BrokerHeartbeatManager now stores the data about how many heartbeats have been missed. I added a builder to make it easier to add more constructor parameters. WindowedEventCounter is a new generic class which counts the number of times it has seen a specific event within a certain time period. In addition to the changes above, in QuorumController, we now increment the metadata fault metric if we get an unexpected exception while processing an event. (In other words, an exception that doesn't map cleanly to a known return code.) This should help catch NullPointerExceptions and similar, if any surface again. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: jira-unsubscr...@kafka.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org