cmccabe opened a new pull request, #13759:
URL: https://github.com/apache/kafka/pull/13759

   When the active KRaft controller is overloaded, it will not be able to 
process broker heartbeat
   requests. Instead, they will be timed out. When using the default 
configuration, this will happen
   if the time needed to process a broker heartbeat climbs above a second for a 
sustained period.
   This, in turn, could lead to brokers being improperly fenced when they are 
still alive.
   
   This PR creates a new state, the overload state, which the active controller 
will enter when it
   fails to process more than 3 heartbeats in a given 5 minute period. While in 
overload state, no
   brokers will be fenced due to missed heartbeats (because we don't know if 
the heartbeats were
   missed because of load, or another reason).
   
   Entering overload state increments the metadata error metric on the 
controller, which will notify
   the operator about the problem. We also log a message when we notice that we 
are out of overload
   state.
   
   BrokerHeartbeatManager now stores the data about how many heartbeats have 
been missed. I added a
   builder to make it easier to add more constructor parameters.
   
   WindowedEventCounter is a new generic class which counts the number of times 
it has seen a specific
   event within a certain time period.
   
   In addition to the changes above, in QuorumController, we now increment the 
metadata fault metric
   if we get an unexpected exception while processing an event. (In other 
words, an exception that
   doesn't map cleanly to a known return code.) This should help catch 
NullPointerExceptions and
   similar, if any surface again.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: jira-unsubscr...@kafka.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Reply via email to