[ 
https://issues.apache.org/jira/browse/KAFKA-705?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13557481#comment-13557481
 ] 

Joel Koshy commented on KAFKA-705:
----------------------------------

I think this is why it happens:

https://github.com/apache/kafka/blob/03eb903ce223ab55c5acbcf4243ce805aaaf4fad/core/src/main/scala/kafka/controller/ReplicaStateMachine.scala#L150

It could occur as follows. Suppose there's a partition 'P' assigned to brokers 
x and y; leaderAndIsr = y, {x, y}
1. Controlled shutdown of broker x; leaderAndIsr -> y, {y}
2. After above completes, kill -15 and then restart broker x
3. Immediately do a controlled shutdown of broker y; so now y is in the list of 
shutting down brokers.

Due to the above, x will not start its follower to 'P' on broker y.

Adding sufficient wait time between (2) and (3) seems to address the issue (in 
your script there's no sleep), but we should handle it properly in the shutdown 
code.
Will think about a fix for that.

                
> Controlled shutdown doesn't seem to work on more than one broker in a cluster
> -----------------------------------------------------------------------------
>
>                 Key: KAFKA-705
>                 URL: https://issues.apache.org/jira/browse/KAFKA-705
>             Project: Kafka
>          Issue Type: Bug
>          Components: core
>    Affects Versions: 0.8
>            Reporter: Neha Narkhede
>            Assignee: Joel Koshy
>            Priority: Critical
>              Labels: bugs
>         Attachments: shutdown_brokers_eat.py, shutdown-command
>
>
> I wrote a script (attached here) to basically round robin through the brokers 
> in a cluster doing the following 2 operations on each of them -
> 1. Send the controlled shutdown admin command. If it succeeds
> 2. Restart the broker
> What I've observed is that only one broker is able to finish the above 
> successfully the first time around. For the rest of the iterations, no broker 
> is able to shutdown using the admin command and every single time it fails 
> with the error message stating the same number of leaders on every broker. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to