[
https://issues.apache.org/jira/browse/KAFKA-14292?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Jason Gustafson resolved KAFKA-14292.
-------------------------------------
Fix Version/s: 3.4.0
3.3.2
Resolution: Fixed
> KRaft broker controlled shutdown can be delayed indefinitely
> ------------------------------------------------------------
>
> Key: KAFKA-14292
> URL: https://issues.apache.org/jira/browse/KAFKA-14292
> Project: Kafka
> Issue Type: Bug
> Reporter: Jason Gustafson
> Assignee: Alyssa Huang
> Priority: Major
> Fix For: 3.4.0, 3.3.2
>
>
> We noticed when rolling a kraft cluster that it took an unexpectedly long
> time for one of the brokers to shutdown. In the logs, we saw the following:
> {code:java}
> Oct 11, 2022 @ 17:53:38.277 [Controller 1] The request from broker 8 to
> shut down can not yet be granted because the lowest active offset 2283357 is
> not greater than the broker's shutdown offset 2283358.
> org.apache.kafka.controller.BrokerHeartbeatManager DEBUG
> 2Oct 11, 2022 @ 17:53:38.277 [Controller 1] Updated the controlled shutdown
> offset for broker 8 to 2283362.
> org.apache.kafka.controller.BrokerHeartbeatManager DEBUG
> 3Oct 11, 2022 @ 17:53:40.278 [Controller 1] Updated the controlled shutdown
> offset for broker 8 to 2283366.
> org.apache.kafka.controller.BrokerHeartbeatManager DEBUG
> 4Oct 11, 2022 @ 17:53:40.278 [Controller 1] The request from broker 8 to
> shut down can not yet be granted because the lowest active offset 2283361 is
> not greater than the broker's shutdown offset 2283362.
> org.apache.kafka.controller.BrokerHeartbeatManager DEBUG
> 5Oct 11, 2022 @ 17:53:42.279 [Controller 1] The request from broker 8 to
> shut down can not yet be granted because the lowest active offset 2283365 is
> not greater than the broker's shutdown offset 2283366.
> org.apache.kafka.controller.BrokerHeartbeatManager DEBUG
> 6Oct 11, 2022 @ 17:53:42.279 [Controller 1] Updated the controlled shutdown
> offset for broker 8 to 2283370.
> org.apache.kafka.controller.BrokerHeartbeatManager DEBUG
> 7Oct 11, 2022 @ 17:53:44.280 [Controller 1] The request from broker 8 to
> shut down can not yet be granted because the lowest active offset 2283369 is
> not greater than the broker's shutdown offset 2283370.
> org.apache.kafka.controller.BrokerHeartbeatManager DEBUG
> 8Oct 11, 2022 @ 17:53:44.281 [Controller 1] Updated the controlled shutdown
> offset for broker 8 to 2283374.
> org.apache.kafka.controller.BrokerHeartbeatManager DEBUG {code}
> From what I can tell, it looks like the controller waits until all brokers
> have caught up to the {{controlledShutdownOffset}} of the broker that is
> shutting down before allowing it to proceed. Probably the intent is to make
> sure they have all the leader and ISR state.
> The problem is that the {{controlledShutdownOffset}} seems to be updated
> after every heartbeat that the controller receives:
> https://github.com/apache/kafka/blob/trunk/metadata/src/main/java/org/apache/kafka/controller/QuorumController.java#L1996.
> Unless all other brokers can catch up to that offset before the next
> heartbeat from the shutting down broker is received, then the broker remains
> in the shutting down state indefinitely.
> In this case, it took more than 40 minutes before the broker completed
> shutdown:
> {code:java}
> 1Oct 11, 2022 @ 18:36:36.105 [Controller 1] The request from broker 8 to
> shut down has been granted since the lowest active offset 2288510 is now
> greater than the broker's controlled shutdown offset 2288510.
> org.apache.kafka.controller.BrokerHeartbeatManager INFO
> 2Oct 11, 2022 @ 18:40:35.197 [Controller 1] The request from broker 8 to
> unfence has been granted because it has caught up with the offset of it's
> register broker record 2288906.
> org.apache.kafka.controller.BrokerHeartbeatManager INFO{code}
> It seems like the bug here is that we should not keep updating
> {{controlledShutdownOffset}} if it has already been set.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)