[ https://issues.apache.org/jira/browse/KAFKA-927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13673182#comment-13673182 ]
Jun Rao commented on KAFKA-927: ------------------------------- Thanks for patch v2. A few more comments: 20. KafkaController: If when shutdownBroker is called, the controller is no longer active, both state machines will throw an exception on state change calls. However, the issue is that we add the shutdown broker to controllerContext.shuttingDownBrokerIds and it's never reset. This may become a problem if this broker becomes a controller again. At the minimum, we need to reset controllerContext.shuttingDownBrokerIds in onControllerFailover(). However, I am a bit confused why we never reset controllerContext.shuttingDownBrokerIds and the shutdown logic still works. 21. ControlledShutdownRequest.handleError(): We should probably set partitionsRemaining in ControlledShutdownResponse to empty instead of null, since the serialization of ControlledShutdownResponse doesn't handle partitionsRemaining being null. 22. testRollingBounce: 22.1 The test makes sure that the leader for topic1 is changed after broker 0 is shutdown. However, the leader for topic1 could be on broker 1 initially. In this case, the leader won't be changed after broker 0 is shutdown. 22.2 The default controlledShutdownRetryBackoffMs is 5secs, which is probably too long for the unit test. 23. KafkaServer: We need to handle the errorCode in ControlledShutdownResponse since the controller may have moved after we send the ControlledShutdown request. >From the previous review: 3. I think a simple solution is to (1) not call replicaManager.replicaFetcherManager.closeAllFetchers() in KafkaServer during shutdown; (2) in KafkaController.shutdownBroker(), for each partition on the shutdown broker, we first send a stopReplicaRequest to it for that partition before going through the state machine logic. Since the state machine logic involves ZK reads/writes, it's very likely that the stopReplicaRequest will reach the broker before the subsequent LeaderAndIsr requests. So, in most cases, the leader should be able to shrink ISR quicker than the timeout, without churns in ISR. > Integrate controlled shutdown into kafka shutdown hook > ------------------------------------------------------ > > Key: KAFKA-927 > URL: https://issues.apache.org/jira/browse/KAFKA-927 > Project: Kafka > Issue Type: Bug > Reporter: Sriram Subramanian > Assignee: Sriram Subramanian > Attachments: KAFKA-927.patch, KAFKA-927-v2.patch > > > The controlled shutdown mechanism should be integrated into the software for > better operational benefits. Also few optimizations can be done to reduce > unnecessary rpc and zk calls. This patch has been tested on a prod like > environment by doing rolling bounces continuously for a day. The average time > of doing a rolling bounce with controlled shutdown for a cluster with 7 nodes > without this patch is 340 seconds. With this patch it reduces to 220 seconds. > Also it ensures correctness in scenarios where the controller shrinks the isr > and the new leader could place the broker to be shutdown back into the isr. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira