[jira] [Commented] (KAFKA-705) Controlled shutdown doesn't seem to work on more than one broker in a cluster
[ https://issues.apache.org/jira/browse/KAFKA-705?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13709525#comment-13709525 ] Joel Koshy commented on KAFKA-705: -- Yes we can close this. Controlled shutdown doesn't seem to work on more than one broker in a cluster - Key: KAFKA-705 URL: https://issues.apache.org/jira/browse/KAFKA-705 Project: Kafka Issue Type: Bug Components: core Affects Versions: 0.8 Reporter: Neha Narkhede Assignee: Joel Koshy Priority: Critical Labels: bugs Attachments: kafka-705-incremental-v2.patch, kafka-705-v1.patch, shutdown_brokers_eat.py, shutdown-command I wrote a script (attached here) to basically round robin through the brokers in a cluster doing the following 2 operations on each of them - 1. Send the controlled shutdown admin command. If it succeeds 2. Restart the broker What I've observed is that only one broker is able to finish the above successfully the first time around. For the rest of the iterations, no broker is able to shutdown using the admin command and every single time it fails with the error message stating the same number of leaders on every broker. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (KAFKA-705) Controlled shutdown doesn't seem to work on more than one broker in a cluster
[ https://issues.apache.org/jira/browse/KAFKA-705?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13706330#comment-13706330 ] Jay Kreps commented on KAFKA-705: - Joel, this is done, no? Controlled shutdown doesn't seem to work on more than one broker in a cluster - Key: KAFKA-705 URL: https://issues.apache.org/jira/browse/KAFKA-705 Project: Kafka Issue Type: Bug Components: core Affects Versions: 0.8 Reporter: Neha Narkhede Assignee: Joel Koshy Priority: Critical Labels: bugs Attachments: kafka-705-incremental-v2.patch, kafka-705-v1.patch, shutdown_brokers_eat.py, shutdown-command I wrote a script (attached here) to basically round robin through the brokers in a cluster doing the following 2 operations on each of them - 1. Send the controlled shutdown admin command. If it succeeds 2. Restart the broker What I've observed is that only one broker is able to finish the above successfully the first time around. For the rest of the iterations, no broker is able to shutdown using the admin command and every single time it fails with the error message stating the same number of leaders on every broker. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (KAFKA-705) Controlled shutdown doesn't seem to work on more than one broker in a cluster
[ https://issues.apache.org/jira/browse/KAFKA-705?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13559793#comment-13559793 ] Neha Narkhede commented on KAFKA-705: - +1 Controlled shutdown doesn't seem to work on more than one broker in a cluster - Key: KAFKA-705 URL: https://issues.apache.org/jira/browse/KAFKA-705 Project: Kafka Issue Type: Bug Components: core Affects Versions: 0.8 Reporter: Neha Narkhede Assignee: Joel Koshy Priority: Critical Labels: bugs Attachments: kafka-705-incremental-v2.patch, kafka-705-v1.patch, shutdown_brokers_eat.py, shutdown-command I wrote a script (attached here) to basically round robin through the brokers in a cluster doing the following 2 operations on each of them - 1. Send the controlled shutdown admin command. If it succeeds 2. Restart the broker What I've observed is that only one broker is able to finish the above successfully the first time around. For the rest of the iterations, no broker is able to shutdown using the admin command and every single time it fails with the error message stating the same number of leaders on every broker. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (KAFKA-705) Controlled shutdown doesn't seem to work on more than one broker in a cluster
[ https://issues.apache.org/jira/browse/KAFKA-705?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13559121#comment-13559121 ] Joel Koshy commented on KAFKA-705: -- I committed the fix to 0.8 with a small edit: used the liveOrShuttingDownBrokers field. Another small issue is that we send a stop replica fetchers to the shutting down broker even if controlled shutdown did not complete. This prematurely forces the broker out of the ISR of those partitions. I think it should be safe to avoid sending the stop replica request if controlled shutdown has not completely moved leadership of partitions off the shutting down broker. Controlled shutdown doesn't seem to work on more than one broker in a cluster - Key: KAFKA-705 URL: https://issues.apache.org/jira/browse/KAFKA-705 Project: Kafka Issue Type: Bug Components: core Affects Versions: 0.8 Reporter: Neha Narkhede Assignee: Joel Koshy Priority: Critical Labels: bugs Attachments: kafka-705-v1.patch, shutdown_brokers_eat.py, shutdown-command I wrote a script (attached here) to basically round robin through the brokers in a cluster doing the following 2 operations on each of them - 1. Send the controlled shutdown admin command. If it succeeds 2. Restart the broker What I've observed is that only one broker is able to finish the above successfully the first time around. For the rest of the iterations, no broker is able to shutdown using the admin command and every single time it fails with the error message stating the same number of leaders on every broker. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (KAFKA-705) Controlled shutdown doesn't seem to work on more than one broker in a cluster
[ https://issues.apache.org/jira/browse/KAFKA-705?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13558341#comment-13558341 ] Neha Narkhede commented on KAFKA-705: - +1 on the fix. And there is a problem with the script I wrote. This fix is correct, but the script will fail because it uses the shutdown command in a way that is not recommended or intended. It shuts down one broker, restarts it, doesn't wait until the restart is completed and the first broker re-registers itself in zookeeper and proceeds to shutting down the next broker. Since the replication factor is 2, if both these brokers were the replicas for some partitions, they go into the under replicated state and the script is never able to shut any other broker down after that. I think we should include this fix. Controlled shutdown doesn't seem to work on more than one broker in a cluster - Key: KAFKA-705 URL: https://issues.apache.org/jira/browse/KAFKA-705 Project: Kafka Issue Type: Bug Components: core Affects Versions: 0.8 Reporter: Neha Narkhede Assignee: Joel Koshy Priority: Critical Labels: bugs Attachments: kafka-705-v1.patch, shutdown_brokers_eat.py, shutdown-command I wrote a script (attached here) to basically round robin through the brokers in a cluster doing the following 2 operations on each of them - 1. Send the controlled shutdown admin command. If it succeeds 2. Restart the broker What I've observed is that only one broker is able to finish the above successfully the first time around. For the rest of the iterations, no broker is able to shutdown using the admin command and every single time it fails with the error message stating the same number of leaders on every broker. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (KAFKA-705) Controlled shutdown doesn't seem to work on more than one broker in a cluster
[ https://issues.apache.org/jira/browse/KAFKA-705?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13557481#comment-13557481 ] Joel Koshy commented on KAFKA-705: -- I think this is why it happens: https://github.com/apache/kafka/blob/03eb903ce223ab55c5acbcf4243ce805aaaf4fad/core/src/main/scala/kafka/controller/ReplicaStateMachine.scala#L150 It could occur as follows. Suppose there's a partition 'P' assigned to brokers x and y; leaderAndIsr = y, {x, y} 1. Controlled shutdown of broker x; leaderAndIsr - y, {y} 2. After above completes, kill -15 and then restart broker x 3. Immediately do a controlled shutdown of broker y; so now y is in the list of shutting down brokers. Due to the above, x will not start its follower to 'P' on broker y. Adding sufficient wait time between (2) and (3) seems to address the issue (in your script there's no sleep), but we should handle it properly in the shutdown code. Will think about a fix for that. Controlled shutdown doesn't seem to work on more than one broker in a cluster - Key: KAFKA-705 URL: https://issues.apache.org/jira/browse/KAFKA-705 Project: Kafka Issue Type: Bug Components: core Affects Versions: 0.8 Reporter: Neha Narkhede Assignee: Joel Koshy Priority: Critical Labels: bugs Attachments: shutdown_brokers_eat.py, shutdown-command I wrote a script (attached here) to basically round robin through the brokers in a cluster doing the following 2 operations on each of them - 1. Send the controlled shutdown admin command. If it succeeds 2. Restart the broker What I've observed is that only one broker is able to finish the above successfully the first time around. For the rest of the iterations, no broker is able to shutdown using the admin command and every single time it fails with the error message stating the same number of leaders on every broker. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (KAFKA-705) Controlled shutdown doesn't seem to work on more than one broker in a cluster
[ https://issues.apache.org/jira/browse/KAFKA-705?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13555352#comment-13555352 ] Joel Koshy commented on KAFKA-705: -- I set up a local cluster of three brokers and created a bunch of topics, replication factor = 2. I was able to do multiple iterations of rolling bounces without issue. Since this was local, I did not use your py script as it kills pid's returned by ps. Would you by any chance be able to provide a scenario to reproduce this locally? That said, I believe John Fung also tried to reproduce this in a distributed environment but was unable to do so; so I'll probably need to take a look at logs in your environment. Controlled shutdown doesn't seem to work on more than one broker in a cluster - Key: KAFKA-705 URL: https://issues.apache.org/jira/browse/KAFKA-705 Project: Kafka Issue Type: Bug Components: core Affects Versions: 0.8 Reporter: Neha Narkhede Assignee: Joel Koshy Priority: Critical Labels: bugs Attachments: shutdown_brokers_eat.py, shutdown-command I wrote a script (attached here) to basically round robin through the brokers in a cluster doing the following 2 operations on each of them - 1. Send the controlled shutdown admin command. If it succeeds 2. Restart the broker What I've observed is that only one broker is able to finish the above successfully the first time around. For the rest of the iterations, no broker is able to shutdown using the admin command and every single time it fails with the error message stating the same number of leaders on every broker. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira