[jira] [Commented] (KAFKA-705) Controlled shutdown doesn't seem to work on more than one broker in a cluster

2013-07-16 Thread Joel Koshy (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-705?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13709525#comment-13709525
 ] 

Joel Koshy commented on KAFKA-705:
--

Yes we can close this.

 Controlled shutdown doesn't seem to work on more than one broker in a cluster
 -

 Key: KAFKA-705
 URL: https://issues.apache.org/jira/browse/KAFKA-705
 Project: Kafka
  Issue Type: Bug
  Components: core
Affects Versions: 0.8
Reporter: Neha Narkhede
Assignee: Joel Koshy
Priority: Critical
  Labels: bugs
 Attachments: kafka-705-incremental-v2.patch, kafka-705-v1.patch, 
 shutdown_brokers_eat.py, shutdown-command


 I wrote a script (attached here) to basically round robin through the brokers 
 in a cluster doing the following 2 operations on each of them -
 1. Send the controlled shutdown admin command. If it succeeds
 2. Restart the broker
 What I've observed is that only one broker is able to finish the above 
 successfully the first time around. For the rest of the iterations, no broker 
 is able to shutdown using the admin command and every single time it fails 
 with the error message stating the same number of leaders on every broker. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (KAFKA-705) Controlled shutdown doesn't seem to work on more than one broker in a cluster

2013-07-11 Thread Jay Kreps (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-705?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13706330#comment-13706330
 ] 

Jay Kreps commented on KAFKA-705:
-

Joel, this is done, no?

 Controlled shutdown doesn't seem to work on more than one broker in a cluster
 -

 Key: KAFKA-705
 URL: https://issues.apache.org/jira/browse/KAFKA-705
 Project: Kafka
  Issue Type: Bug
  Components: core
Affects Versions: 0.8
Reporter: Neha Narkhede
Assignee: Joel Koshy
Priority: Critical
  Labels: bugs
 Attachments: kafka-705-incremental-v2.patch, kafka-705-v1.patch, 
 shutdown_brokers_eat.py, shutdown-command


 I wrote a script (attached here) to basically round robin through the brokers 
 in a cluster doing the following 2 operations on each of them -
 1. Send the controlled shutdown admin command. If it succeeds
 2. Restart the broker
 What I've observed is that only one broker is able to finish the above 
 successfully the first time around. For the rest of the iterations, no broker 
 is able to shutdown using the admin command and every single time it fails 
 with the error message stating the same number of leaders on every broker. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (KAFKA-705) Controlled shutdown doesn't seem to work on more than one broker in a cluster

2013-01-22 Thread Neha Narkhede (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-705?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13559793#comment-13559793
 ] 

Neha Narkhede commented on KAFKA-705:
-

+1

 Controlled shutdown doesn't seem to work on more than one broker in a cluster
 -

 Key: KAFKA-705
 URL: https://issues.apache.org/jira/browse/KAFKA-705
 Project: Kafka
  Issue Type: Bug
  Components: core
Affects Versions: 0.8
Reporter: Neha Narkhede
Assignee: Joel Koshy
Priority: Critical
  Labels: bugs
 Attachments: kafka-705-incremental-v2.patch, kafka-705-v1.patch, 
 shutdown_brokers_eat.py, shutdown-command


 I wrote a script (attached here) to basically round robin through the brokers 
 in a cluster doing the following 2 operations on each of them -
 1. Send the controlled shutdown admin command. If it succeeds
 2. Restart the broker
 What I've observed is that only one broker is able to finish the above 
 successfully the first time around. For the rest of the iterations, no broker 
 is able to shutdown using the admin command and every single time it fails 
 with the error message stating the same number of leaders on every broker. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (KAFKA-705) Controlled shutdown doesn't seem to work on more than one broker in a cluster

2013-01-21 Thread Joel Koshy (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-705?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13559121#comment-13559121
 ] 

Joel Koshy commented on KAFKA-705:
--

I committed the fix to 0.8 with a small edit: used the 
liveOrShuttingDownBrokers field.

Another small issue is that we send a stop replica fetchers to the shutting 
down broker even if
controlled shutdown did not complete. This prematurely forces the broker out 
of the ISR of those
partitions. I think it should be safe to avoid sending the stop replica request 
if controlled shutdown
has not completely moved leadership of partitions off the shutting down broker.


 Controlled shutdown doesn't seem to work on more than one broker in a cluster
 -

 Key: KAFKA-705
 URL: https://issues.apache.org/jira/browse/KAFKA-705
 Project: Kafka
  Issue Type: Bug
  Components: core
Affects Versions: 0.8
Reporter: Neha Narkhede
Assignee: Joel Koshy
Priority: Critical
  Labels: bugs
 Attachments: kafka-705-v1.patch, shutdown_brokers_eat.py, 
 shutdown-command


 I wrote a script (attached here) to basically round robin through the brokers 
 in a cluster doing the following 2 operations on each of them -
 1. Send the controlled shutdown admin command. If it succeeds
 2. Restart the broker
 What I've observed is that only one broker is able to finish the above 
 successfully the first time around. For the rest of the iterations, no broker 
 is able to shutdown using the admin command and every single time it fails 
 with the error message stating the same number of leaders on every broker. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (KAFKA-705) Controlled shutdown doesn't seem to work on more than one broker in a cluster

2013-01-20 Thread Neha Narkhede (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-705?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13558341#comment-13558341
 ] 

Neha Narkhede commented on KAFKA-705:
-

+1 on the fix. And there is a problem with the script I wrote. This fix is 
correct, but the script will fail because it uses the shutdown command in a way 
that is not recommended or intended. It shuts down one broker, restarts it, 
doesn't wait until the restart is completed and the first broker re-registers 
itself in zookeeper and proceeds to shutting down the next broker. Since the 
replication factor is 2, if both these brokers were the replicas for some 
partitions, they go into the under replicated state and the script is never 
able to shut any other broker down after that.

I think we should include this fix.

 Controlled shutdown doesn't seem to work on more than one broker in a cluster
 -

 Key: KAFKA-705
 URL: https://issues.apache.org/jira/browse/KAFKA-705
 Project: Kafka
  Issue Type: Bug
  Components: core
Affects Versions: 0.8
Reporter: Neha Narkhede
Assignee: Joel Koshy
Priority: Critical
  Labels: bugs
 Attachments: kafka-705-v1.patch, shutdown_brokers_eat.py, 
 shutdown-command


 I wrote a script (attached here) to basically round robin through the brokers 
 in a cluster doing the following 2 operations on each of them -
 1. Send the controlled shutdown admin command. If it succeeds
 2. Restart the broker
 What I've observed is that only one broker is able to finish the above 
 successfully the first time around. For the rest of the iterations, no broker 
 is able to shutdown using the admin command and every single time it fails 
 with the error message stating the same number of leaders on every broker. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (KAFKA-705) Controlled shutdown doesn't seem to work on more than one broker in a cluster

2013-01-18 Thread Joel Koshy (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-705?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13557481#comment-13557481
 ] 

Joel Koshy commented on KAFKA-705:
--

I think this is why it happens:

https://github.com/apache/kafka/blob/03eb903ce223ab55c5acbcf4243ce805aaaf4fad/core/src/main/scala/kafka/controller/ReplicaStateMachine.scala#L150

It could occur as follows. Suppose there's a partition 'P' assigned to brokers 
x and y; leaderAndIsr = y, {x, y}
1. Controlled shutdown of broker x; leaderAndIsr - y, {y}
2. After above completes, kill -15 and then restart broker x
3. Immediately do a controlled shutdown of broker y; so now y is in the list of 
shutting down brokers.

Due to the above, x will not start its follower to 'P' on broker y.

Adding sufficient wait time between (2) and (3) seems to address the issue (in 
your script there's no sleep), but we should handle it properly in the shutdown 
code.
Will think about a fix for that.


 Controlled shutdown doesn't seem to work on more than one broker in a cluster
 -

 Key: KAFKA-705
 URL: https://issues.apache.org/jira/browse/KAFKA-705
 Project: Kafka
  Issue Type: Bug
  Components: core
Affects Versions: 0.8
Reporter: Neha Narkhede
Assignee: Joel Koshy
Priority: Critical
  Labels: bugs
 Attachments: shutdown_brokers_eat.py, shutdown-command


 I wrote a script (attached here) to basically round robin through the brokers 
 in a cluster doing the following 2 operations on each of them -
 1. Send the controlled shutdown admin command. If it succeeds
 2. Restart the broker
 What I've observed is that only one broker is able to finish the above 
 successfully the first time around. For the rest of the iterations, no broker 
 is able to shutdown using the admin command and every single time it fails 
 with the error message stating the same number of leaders on every broker. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (KAFKA-705) Controlled shutdown doesn't seem to work on more than one broker in a cluster

2013-01-16 Thread Joel Koshy (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-705?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13555352#comment-13555352
 ] 

Joel Koshy commented on KAFKA-705:
--

I set up a local cluster of three brokers and created a bunch of topics, 
replication factor = 2. I was able to do multiple iterations of rolling bounces 
without
issue. Since this was local, I did not use your py script as it kills pid's 
returned by ps.

Would you by any chance be able to provide a scenario to reproduce this 
locally? That said, I believe John Fung also tried to reproduce this in a
distributed environment but was unable to do so; so I'll probably need to take 
a look at logs in your environment.


 Controlled shutdown doesn't seem to work on more than one broker in a cluster
 -

 Key: KAFKA-705
 URL: https://issues.apache.org/jira/browse/KAFKA-705
 Project: Kafka
  Issue Type: Bug
  Components: core
Affects Versions: 0.8
Reporter: Neha Narkhede
Assignee: Joel Koshy
Priority: Critical
  Labels: bugs
 Attachments: shutdown_brokers_eat.py, shutdown-command


 I wrote a script (attached here) to basically round robin through the brokers 
 in a cluster doing the following 2 operations on each of them -
 1. Send the controlled shutdown admin command. If it succeeds
 2. Restart the broker
 What I've observed is that only one broker is able to finish the above 
 successfully the first time around. For the rest of the iterations, no broker 
 is able to shutdown using the admin command and every single time it fails 
 with the error message stating the same number of leaders on every broker. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira