[jira] [Commented] (KAFKA-6442) Catch 22 with cluster rebalancing
[ https://issues.apache.org/jira/browse/KAFKA-6442?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16323763#comment-16323763 ] Andreas commented on KAFKA-6442: For what is worth, restarting node4 (for maintenance) seems to have got everything unstuck. I guess it triggered leader relection internally? > Catch 22 with cluster rebalancing > - > > Key: KAFKA-6442 > URL: https://issues.apache.org/jira/browse/KAFKA-6442 > Project: Kafka > Issue Type: Bug >Affects Versions: 0.8.2.1 >Reporter: Andreas > > PS. I classified this as a bug because I think the cluster should not be > stuck in that situation, apologies if that is wrong. > Hi, > I found myself in a situation a bit difficult to explain so I will skip the > how I ended up in this situation, but here is the problem. > Some of the brokers of my cluster are permanently gone. Consequently, I had > some partitions that now had offline leaders etc so, I used the > {{kafka-reassign-partitions.sh}} to rebalance my topics and for the most part > that worked ok. Where that did not work ok, was for partitions that had > leaders, rs and irs completely in the gone brokers. Those got stuck halfway > through to what now looks like > Topic: topicA Partition: 32 Leader: -1 Replicas: 1,6,2,7,3,8 Isr: > (1,2,3 are legit, 6,7,8 permanently gone) > So the first catch 22, is that I cannot elect a new leader, because the > leader needs to be elected from the ISR, and I cannot recreate the ISR > because the topic has no leader. > The second catch 22 is that I cannot rerun {{kafka-reassign-partitions.sh}} > because the previous one is supposedly still in progress, and I cannot > increase the number of partitions to account for the now permanently offline > partitions, because that produces the following error {{Error while executing > topic command requirement failed: All partitions should have the same number > of replicas.}}, from which I cannot recover because I cannot run > {{kafka-reassign-partitions.sh}}. > Is there a way to recover from such a situation? -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (KAFKA-6442) Catch 22 with cluster rebalancing
[ https://issues.apache.org/jira/browse/KAFKA-6442?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16323747#comment-16323747 ] Andreas commented on KAFKA-6442: Thanks for the reply. I am afraid "unclean.leader.election.enable" is not set at all, so it should default to true. Running ./zookeeper-shell.sh localhost:2181 <<< "ls /brokers/ids" returns WatchedEvent state:SyncConnected type:None path:null [1, 2, 3, 4] which is legit. > Catch 22 with cluster rebalancing > - > > Key: KAFKA-6442 > URL: https://issues.apache.org/jira/browse/KAFKA-6442 > Project: Kafka > Issue Type: Bug >Affects Versions: 0.8.2.1 >Reporter: Andreas > > PS. I classified this as a bug because I think the cluster should not be > stuck in that situation, apologies if that is wrong. > Hi, > I found myself in a situation a bit difficult to explain so I will skip the > how I ended up in this situation, but here is the problem. > Some of the brokers of my cluster are permanently gone. Consequently, I had > some partitions that now had offline leaders etc so, I used the > {{kafka-reassign-partitions.sh}} to rebalance my topics and for the most part > that worked ok. Where that did not work ok, was for partitions that had > leaders, rs and irs completely in the gone brokers. Those got stuck halfway > through to what now looks like > Topic: topicA Partition: 32 Leader: -1 Replicas: 1,6,2,7,3,8 Isr: > (1,2,3 are legit, 6,7,8 permanently gone) > So the first catch 22, is that I cannot elect a new leader, because the > leader needs to be elected from the ISR, and I cannot recreate the ISR > because the topic has no leader. > The second catch 22 is that I cannot rerun {{kafka-reassign-partitions.sh}} > because the previous one is supposedly still in progress, and I cannot > increase the number of partitions to account for the now permanently offline > partitions, because that produces the following error {{Error while executing > topic command requirement failed: All partitions should have the same number > of replicas.}}, from which I cannot recover because I cannot run > {{kafka-reassign-partitions.sh}}. > Is there a way to recover from such a situation? -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (KAFKA-6442) Catch 22 with cluster rebalancing
[ https://issues.apache.org/jira/browse/KAFKA-6442?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16323021#comment-16323021 ] Jan Filipiak commented on KAFKA-6442: - I guess discussing this on the mailing-list / slack first could help quicker. Did you set "unclean.leader.election.enable" to false? An unclean election should allow broker 1 to take leadership with whatever it has. If unclean election is true and its still not stepping up I usually delete the /controller node in zk to have a new controller elected to take care of these partitions again. > Catch 22 with cluster rebalancing > - > > Key: KAFKA-6442 > URL: https://issues.apache.org/jira/browse/KAFKA-6442 > Project: Kafka > Issue Type: Bug >Affects Versions: 0.8.2.1 >Reporter: Andreas > > PS. I classified this as a bug because I think the cluster should not be > stuck in that situation, apologies if that is wrong. > Hi, > I found myself in a situation a bit difficult to explain so I will skip the > how I ended up in this situation, but here is the problem. > Some of the brokers of my cluster are permanently gone. Consequently, I had > some partitions that now had offline leaders etc so, I used the > {{kafka-reassign-partitions.sh}} to rebalance my topics and for the most part > that worked ok. Where that did not work ok, was for partitions that had > leaders, rs and irs completely in the gone brokers. Those got stuck halfway > through to what now looks like > Topic: topicA Partition: 32 Leader: -1 Replicas: 1,6,2,7,3,8 Isr: > (1,2,3 are legit, 6,7,8 permanently gone) > So the first catch 22, is that I cannot elect a new leader, because the > leader needs to be elected from the ISR, and I cannot recreate the ISR > because the topic has no leader. > The second catch 22 is that I cannot rerun {{kafka-reassign-partitions.sh}} > because the previous one is supposedly still in progress, and I cannot > increase the number of partitions to account for the now permanently offline > partitions, because that produces the following error {{Error while executing > topic command requirement failed: All partitions should have the same number > of replicas.}}, from which I cannot recover because I cannot run > {{kafka-reassign-partitions.sh}}. > Is there a way to recover from such a situation? -- This message was sent by Atlassian JIRA (v6.4.14#64029)