[ 
https://issues.apache.org/jira/browse/KAFKA-927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13673182#comment-13673182
 ] 

Jun Rao commented on KAFKA-927:
-------------------------------

Thanks for patch v2. A few more comments:

20. KafkaController: If when shutdownBroker is called, the controller is no 
longer active, both state machines will throw an exception on state change 
calls. However, the issue is that we add the shutdown broker to 
controllerContext.shuttingDownBrokerIds and it's never reset. This may become a 
problem if this broker becomes a controller again. At the minimum, we need to 
reset controllerContext.shuttingDownBrokerIds in  onControllerFailover(). 
However, I am a bit confused why we never reset 
controllerContext.shuttingDownBrokerIds and the shutdown logic still works.

21. ControlledShutdownRequest.handleError(): We should probably set 
partitionsRemaining in ControlledShutdownResponse to empty instead of null, 
since the serialization of ControlledShutdownResponse doesn't handle 
partitionsRemaining being null.

22. testRollingBounce:
22.1 The test makes sure that the leader for topic1 is changed after broker 0 
is shutdown. However, the leader for topic1 could be on broker 1 initially. In 
this case, the leader won't be changed after broker 0 is shutdown.
22.2 The default controlledShutdownRetryBackoffMs is 5secs, which is probably 
too long for the unit test. 

23. KafkaServer: We need to handle the errorCode in ControlledShutdownResponse 
since the controller may have moved after we send the ControlledShutdown 
request.

>From the previous review:
3. I think a simple solution is to (1) not call 
replicaManager.replicaFetcherManager.closeAllFetchers() in KafkaServer during 
shutdown; (2) in KafkaController.shutdownBroker(), for each partition on the 
shutdown broker, we first send a stopReplicaRequest to it for that partition 
before going through the state machine logic. Since the state machine logic 
involves ZK reads/writes, it's very likely that the stopReplicaRequest will 
reach the broker before the subsequent LeaderAndIsr requests. So, in most 
cases, the leader should be able to shrink ISR quicker than the timeout, 
without churns in ISR.
                
> Integrate controlled shutdown into kafka shutdown hook
> ------------------------------------------------------
>
>                 Key: KAFKA-927
>                 URL: https://issues.apache.org/jira/browse/KAFKA-927
>             Project: Kafka
>          Issue Type: Bug
>            Reporter: Sriram Subramanian
>            Assignee: Sriram Subramanian
>         Attachments: KAFKA-927.patch, KAFKA-927-v2.patch
>
>
> The controlled shutdown mechanism should be integrated into the software for 
> better operational benefits. Also few optimizations can be done to reduce 
> unnecessary rpc and zk calls. This patch has been tested on a prod like 
> environment by doing rolling bounces continuously for a day. The average time 
> of doing a rolling bounce with controlled shutdown for a cluster with 7 nodes 
> without this patch is 340 seconds. With this patch it reduces to 220 seconds. 
> Also it ensures correctness in scenarios where the controller shrinks the isr 
> and the new leader could place the broker to be shutdown back into the isr.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to