[jira] [Commented] (KAFKA-13944) Shutting down broker can be elected as partition leader in KRaft
[ https://issues.apache.org/jira/browse/KAFKA-13944?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17547157#comment-17547157 ] Jose Armando Garcia Sancio commented on KAFKA-13944: Looks like this issue is addressed by https://issues.apache.org/jira/browse/KAFKA-13916 > Shutting down broker can be elected as partition leader in KRaft > > > Key: KAFKA-13944 > URL: https://issues.apache.org/jira/browse/KAFKA-13944 > Project: Kafka > Issue Type: Bug >Reporter: Jason Gustafson >Assignee: Jose Armando Garcia Sancio >Priority: Major > Labels: kip-500 > > When a broker requests shutdown, it transitions to the CONTROLLED_SHUTDOWN > state in the controller. It is possible for the broker to remain unfenced in > this state until the controlled shutdown completes. When doing an election, > the only thing we generally check is that the broker is unfenced, so this > means we can elect a broker that is in controlled shutdown. > Here are a few snippets from a recent system test in which this occurred: > {code:java} > // broker 2 starts controlled shutdown > [2022-05-26 21:17:26,451] INFO [Controller 3001] Unfenced broker 2 has > requested and been granted a controlled shutdown. > (org.apache.kafka.controller.BrokerHeartbeatManager) > > // there is only one replica, so we set leader to -1 > [2022-05-26 21:17:26,452] DEBUG [Controller 3001] partition change for _foo-1 > with topic ID _iUQ72T_R4mmZgI3WrsyXw: leader: 2 -> -1, leaderEpoch: 0 -> 1, > partitionEpoch: 0 -> 1 (org.apache.kafka.controller.ReplicationControlManager) > // controlled shutdown cannot complete immediately > [2022-05-26 21:17:26,529] DEBUG [Controller 3001] The request from broker 2 > to shut down can not yet be granted because the lowest active offset 177 is > not greater than the broker's shutdown offset 244. > (org.apache.kafka.controller.BrokerHeartbeatManager) > [2022-05-26 21:17:26,530] DEBUG [Controller 3001] Updated the controlled > shutdown offset for broker 2 to 244. > (org.apache.kafka.controller.BrokerHeartbeatManager) > // later on we elect leader 2 again > [2022-05-26 21:17:27,703] DEBUG [Controller 3001] partition change for _foo-1 > with topic ID _iUQ72T_R4mmZgI3WrsyXw: leader: -1 -> 2, leaderEpoch: 1 -> 2, > partitionEpoch: 1 -> 2 (org.apache.kafka.controller.ReplicationControlManager) > // now controlled shutdown is stuck because of the newly elected leader > [2022-05-26 21:17:28,531] DEBUG [Controller 3001] Broker 2 is in controlled > shutdown state, but can not shut down because more leaders still need to be > moved. (org.apache.kafka.controller.BrokerHeartbeatManager) > {code} -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Commented] (KAFKA-13944) Shutting down broker can be elected as partition leader in KRaft
[ https://issues.apache.org/jira/browse/KAFKA-13944?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17545042#comment-17545042 ] Jose Armando Garcia Sancio commented on KAFKA-13944: When fixing this lets improve the logging so that the replica control manager logs the reason that triggered the election. > Shutting down broker can be elected as partition leader in KRaft > > > Key: KAFKA-13944 > URL: https://issues.apache.org/jira/browse/KAFKA-13944 > Project: Kafka > Issue Type: Bug >Reporter: Jason Gustafson >Priority: Major > Labels: kip-500 > > When a broker requests shutdown, it transitions to the CONTROLLED_SHUTDOWN > state in the controller. It is possible for the broker to remain unfenced in > this state until the controlled shutdown completes. When doing an election, > the only thing we generally check is that the broker is unfenced, so this > means we can elect a broker that is in controlled shutdown. > Here are a few snippets from a recent system test in which this occurred: > {code:java} > // broker 2 starts controlled shutdown > [2022-05-26 21:17:26,451] INFO [Controller 3001] Unfenced broker 2 has > requested and been granted a controlled shutdown. > (org.apache.kafka.controller.BrokerHeartbeatManager) > > // there is only one replica, so we set leader to -1 > [2022-05-26 21:17:26,452] DEBUG [Controller 3001] partition change for _foo-1 > with topic ID _iUQ72T_R4mmZgI3WrsyXw: leader: 2 -> -1, leaderEpoch: 0 -> 1, > partitionEpoch: 0 -> 1 (org.apache.kafka.controller.ReplicationControlManager) > // controlled shutdown cannot complete immediately > [2022-05-26 21:17:26,529] DEBUG [Controller 3001] The request from broker 2 > to shut down can not yet be granted because the lowest active offset 177 is > not greater than the broker's shutdown offset 244. > (org.apache.kafka.controller.BrokerHeartbeatManager) > [2022-05-26 21:17:26,530] DEBUG [Controller 3001] Updated the controlled > shutdown offset for broker 2 to 244. > (org.apache.kafka.controller.BrokerHeartbeatManager) > // later on we elect leader 2 again > [2022-05-26 21:17:27,703] DEBUG [Controller 3001] partition change for _foo-1 > with topic ID _iUQ72T_R4mmZgI3WrsyXw: leader: -1 -> 2, leaderEpoch: 1 -> 2, > partitionEpoch: 1 -> 2 (org.apache.kafka.controller.ReplicationControlManager) > // now controlled shutdown is stuck because of the newly elected leader > [2022-05-26 21:17:28,531] DEBUG [Controller 3001] Broker 2 is in controlled > shutdown state, but can not shut down because more leaders still need to be > moved. (org.apache.kafka.controller.BrokerHeartbeatManager) > {code} -- This message was sent by Atlassian Jira (v8.20.7#820007)