[jira] [Commented] (KAFKA-6013) Controller getting stuck
[ https://issues.apache.org/jira/browse/KAFKA-6013?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16199121#comment-16199121 ] Ivan Babrou commented on KAFKA-6013: I forced preferred leader election and current controller switched 9 -> 7 -> 0. > Controller getting stuck > > > Key: KAFKA-6013 > URL: https://issues.apache.org/jira/browse/KAFKA-6013 > Project: Kafka > Issue Type: Bug >Affects Versions: 0.11.0.0, 0.11.0.1 >Reporter: Ivan Babrou > Labels: reliability > > It looks like a new issue in 0.11.0.0 and 0.11.0.1 still has it. > We upgraded one of the clusters from 0.11.0.0 to 0.11.0.1 by shutting down 28 > machines at once (single rack). When nodes came up none of them progressed > after these log lines: > {noformat} > Oct 05 02:17:42 mybroker14 kafka[32940]: INFO Kafka version : 0.11.0.1 > (org.apache.kafka.common.utils.AppInfoParser) > Oct 05 02:17:42 mybroker14 kafka[32940]: INFO Kafka commitId : > c2a0d5f9b1f45bf5 (org.apache.kafka.common.utils.AppInfoParser) > Oct 05 02:17:42 mybroker14 kafka[32940]: INFO [Kafka Server 10014], started > (kafka.server.KafkaServer) > {noformat} > There was no indication in controller node logs that it picked up rebooted > nodes. This happened multiple times during the upgrade: once per rack plus > some on top of that. > Reboot took ~20m, all nodes in a single rack rebooted in parallel. > The fix was to restart controller node, but that did not go cleanly too: > {noformat} > ivan@mybroker26:~$ sudo journalctl --since 01:00 -u kafka | fgrep 'Error > during controlled shutdown' -A1 > Oct 05 01:57:41 mybroker26 kafka[37409]: WARN [Kafka Server 10026], Error > during controlled shutdown, possibly because leader movement took longer than > the configured controller.socket.timeout.ms and/or request.timeout.ms: > Connection to 10026 was disconnected before the response was read > (kafka.server.KafkaServer) > Oct 05 01:57:46 mybroker26 kafka[37409]: WARN [Kafka Server 10026], Retrying > controlled shutdown after the previous attempt failed... > (kafka.server.KafkaServer) > -- > Oct 05 01:58:16 mybroker26 kafka[37409]: WARN [Kafka Server 10026], Error > during controlled shutdown, possibly because leader movement took longer than > the configured controller.socket.timeout.ms and/or request.timeout.ms: > Connection to 10026 was disconnected before the response was read > (kafka.server.KafkaServer) > Oct 05 01:58:18 mybroker26 kafka[37409]: INFO Rolled new log segment for > 'requests-40' in 3 ms. (kafka.log.Log) > -- > Oct 05 01:58:51 mybroker26 kafka[37409]: WARN [Kafka Server 10026], Error > during controlled shutdown, possibly because leader movement took longer than > the configured controller.socket.timeout.ms and/or request.timeout.ms: > Connection to 10026 was disconnected before the response was read > (kafka.server.KafkaServer) > Oct 05 01:58:56 mybroker26 kafka[37409]: WARN [Kafka Server 10026], Retrying > controlled shutdown after the previous attempt failed... > (kafka.server.KafkaServer) > {noformat} > I'm unable to reproduce the issue by just restarting or even rebooting one > broker, controller picks it up: > {noformat} > Oct 05 03:18:18 mybroker83 kafka[37402]: INFO [Controller 10083]: Newly added > brokers: 10001, deleted brokers: , all live brokers: ... > {noformat} > KAFKA-5028 happened in 0.11.0.0, so it's likely related. > cc [~ijuma] -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (KAFKA-6013) Controller getting stuck
[ https://issues.apache.org/jira/browse/KAFKA-6013?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16199031#comment-16199031 ] Ivan Babrou commented on KAFKA-6013: I restarted one node again today and controller forgot to pick it up after restart. I had to restart controller nodes twice, because each new controller also managed to forget about previous controller node I restarted. On the third node everything finally came to order. All these machines were up for ~5d. One observation that I don't understand: kafka.controller:type=KafkaController,name=ControllerState reports state 9 (ISR change) all the time. Shouldn't it be 0 (idle) most of the time? Stuck controllers have this stack for controller thread: {noformat} Oct 10 17:05:17 mybroker70 kafka[37433]: "controller-event-thread" #77 prio=5 os_prio=0 tid=0x7f5cda487800 nid=0x963f in Object.wait() [0x7f5aaeced000] Oct 10 17:05:17 mybroker70 kafka[37433]:java.lang.Thread.State: WAITING (on object monitor) Oct 10 17:05:17 mybroker70 kafka[37433]: at java.lang.Object.wait(Native Method) Oct 10 17:05:17 mybroker70 kafka[37433]: at java.lang.Object.wait(Object.java:502) Oct 10 17:05:17 mybroker70 kafka[37433]: at org.apache.zookeeper.ClientCnxn.submitRequest(ClientCnxn.java:1406) Oct 10 17:05:17 mybroker70 kafka[37433]: - locked <0x0007b2e00540> (a org.apache.zookeeper.ClientCnxn$Packet) Oct 10 17:05:17 mybroker70 kafka[37433]: at org.apache.zookeeper.ZooKeeper.getData(ZooKeeper.java:1210) Oct 10 17:05:17 mybroker70 kafka[37433]: at org.apache.zookeeper.ZooKeeper.getData(ZooKeeper.java:1241) Oct 10 17:05:17 mybroker70 kafka[37433]: at org.I0Itec.zkclient.ZkConnection.readData(ZkConnection.java:125) Oct 10 17:05:17 mybroker70 kafka[37433]: at org.I0Itec.zkclient.ZkClient$12.call(ZkClient.java:1104) Oct 10 17:05:17 mybroker70 kafka[37433]: at org.I0Itec.zkclient.ZkClient$12.call(ZkClient.java:1100) Oct 10 17:05:17 mybroker70 kafka[37433]: at org.I0Itec.zkclient.ZkClient.retryUntilConnected(ZkClient.java:991) Oct 10 17:05:17 mybroker70 kafka[37433]: at org.I0Itec.zkclient.ZkClient.readData(ZkClient.java:1100) Oct 10 17:05:17 mybroker70 kafka[37433]: at org.I0Itec.zkclient.ZkClient.readData(ZkClient.java:1095) Oct 10 17:05:17 mybroker70 kafka[37433]: at kafka.utils.ZkUtils.readDataMaybeNull(ZkUtils.scala:660) Oct 10 17:05:17 mybroker70 kafka[37433]: at kafka.controller.KafkaController$IsrChangeNotification.getTopicAndPartition(KafkaController.scala:1329) Oct 10 17:05:17 mybroker70 kafka[37433]: at kafka.controller.KafkaController$IsrChangeNotification.$anonfun$process$26(KafkaController.scala:1310) Oct 10 17:05:17 mybroker70 kafka[37433]: at kafka.controller.KafkaController$IsrChangeNotification$$Lambda$1253/1422719045.apply(Unknown Source) Oct 10 17:05:17 mybroker70 kafka[37433]: at scala.collection.TraversableLike.$anonfun$flatMap$1(TraversableLike.scala:241) Oct 10 17:05:17 mybroker70 kafka[37433]: at scala.collection.TraversableLike$$Lambda$391/1306246648.apply(Unknown Source) Oct 10 17:05:17 mybroker70 kafka[37433]: at scala.collection.Iterator.foreach(Iterator.scala:929) Oct 10 17:05:17 mybroker70 kafka[37433]: at scala.collection.Iterator.foreach$(Iterator.scala:929) Oct 10 17:05:17 mybroker70 kafka[37433]: at scala.collection.AbstractIterator.foreach(Iterator.scala:1417) Oct 10 17:05:17 mybroker70 kafka[37433]: at scala.collection.IterableLike.foreach(IterableLike.scala:71) Oct 10 17:05:17 mybroker70 kafka[37433]: at scala.collection.IterableLike.foreach$(IterableLike.scala:70) Oct 10 17:05:17 mybroker70 kafka[37433]: at scala.collection.AbstractIterable.foreach(Iterable.scala:54) Oct 10 17:05:17 mybroker70 kafka[37433]: at scala.collection.TraversableLike.flatMap(TraversableLike.scala:241) Oct 10 17:05:17 mybroker70 kafka[37433]: at scala.collection.TraversableLike.flatMap$(TraversableLike.scala:238) Oct 10 17:05:17 mybroker70 kafka[37433]: at scala.collection.AbstractTraversable.flatMap(Traversable.scala:104) Oct 10 17:05:17 mybroker70 kafka[37433]: at kafka.controller.KafkaController$IsrChangeNotification.process(KafkaController.scala:1310) Oct 10 17:05:17 mybroker70 kafka[37433]: at kafka.controller.ControllerEventManager$ControllerEventThread.$anonfun$doWork$1(ControllerEventManager.scala:50) Oct 10 17:05:17 mybroker70 kafka[37433]: at kafka.controller.ControllerEventManager$ControllerEventThread$$Lambda$395/1856206530.apply$mcV$sp(Unknown Source) Oct 10 17:05:17 mybroker70 kafka[37433]: at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:12) Oct 10 17:05:17 mybroker70 kafka[37433]: at kafka.metrics.KafkaTimer.time(KafkaTimer.scala:31) Oct 10 17:05:17
[jira] [Commented] (KAFKA-6013) Controller getting stuck
[ https://issues.apache.org/jira/browse/KAFKA-6013?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16198489#comment-16198489 ] Ismael Juma commented on KAFKA-6013: cc [~onurkaraman] [~junrao] > Controller getting stuck > > > Key: KAFKA-6013 > URL: https://issues.apache.org/jira/browse/KAFKA-6013 > Project: Kafka > Issue Type: Bug >Affects Versions: 0.11.0.0, 0.11.0.1 >Reporter: Ivan Babrou > Labels: reliability > > It looks like a new issue in 0.11.0.0 and 0.11.0.1 still has it. > We upgraded one of the clusters from 0.11.0.0 to 0.11.0.1 by shutting down 28 > machines at once (single rack). When nodes came up none of them progressed > after these log lines: > {noformat} > Oct 05 02:17:42 mybroker14 kafka[32940]: INFO Kafka version : 0.11.0.1 > (org.apache.kafka.common.utils.AppInfoParser) > Oct 05 02:17:42 mybroker14 kafka[32940]: INFO Kafka commitId : > c2a0d5f9b1f45bf5 (org.apache.kafka.common.utils.AppInfoParser) > Oct 05 02:17:42 mybroker14 kafka[32940]: INFO [Kafka Server 10014], started > (kafka.server.KafkaServer) > {noformat} > There was no indication in controller node logs that it picked up rebooted > nodes. This happened multiple times during the upgrade: once per rack plus > some on top of that. > Reboot took ~20m, all nodes in a single rack rebooted in parallel. > The fix was to restart controller node, but that did not go cleanly too: > {noformat} > ivan@mybroker26:~$ sudo journalctl --since 01:00 -u kafka | fgrep 'Error > during controlled shutdown' -A1 > Oct 05 01:57:41 mybroker26 kafka[37409]: WARN [Kafka Server 10026], Error > during controlled shutdown, possibly because leader movement took longer than > the configured controller.socket.timeout.ms and/or request.timeout.ms: > Connection to 10026 was disconnected before the response was read > (kafka.server.KafkaServer) > Oct 05 01:57:46 mybroker26 kafka[37409]: WARN [Kafka Server 10026], Retrying > controlled shutdown after the previous attempt failed... > (kafka.server.KafkaServer) > -- > Oct 05 01:58:16 mybroker26 kafka[37409]: WARN [Kafka Server 10026], Error > during controlled shutdown, possibly because leader movement took longer than > the configured controller.socket.timeout.ms and/or request.timeout.ms: > Connection to 10026 was disconnected before the response was read > (kafka.server.KafkaServer) > Oct 05 01:58:18 mybroker26 kafka[37409]: INFO Rolled new log segment for > 'requests-40' in 3 ms. (kafka.log.Log) > -- > Oct 05 01:58:51 mybroker26 kafka[37409]: WARN [Kafka Server 10026], Error > during controlled shutdown, possibly because leader movement took longer than > the configured controller.socket.timeout.ms and/or request.timeout.ms: > Connection to 10026 was disconnected before the response was read > (kafka.server.KafkaServer) > Oct 05 01:58:56 mybroker26 kafka[37409]: WARN [Kafka Server 10026], Retrying > controlled shutdown after the previous attempt failed... > (kafka.server.KafkaServer) > {noformat} > I'm unable to reproduce the issue by just restarting or even rebooting one > broker, controller picks it up: > {noformat} > Oct 05 03:18:18 mybroker83 kafka[37402]: INFO [Controller 10083]: Newly added > brokers: 10001, deleted brokers: , all live brokers: ... > {noformat} > KAFKA-5028 happened in 0.11.0.0, so it's likely related. > cc [~ijuma] -- This message was sent by Atlassian JIRA (v6.4.14#64029)