Jason Rosenberg created KAFKA-2125: -------------------------------------- Summary: Infinite loop after controlled shutdown succeeds Key: KAFKA-2125 URL: https://issues.apache.org/jira/browse/KAFKA-2125 Project: Kafka Issue Type: Bug Affects Versions: 0.8.2.1 Reporter: Jason Rosenberg Priority: Blocker
I have a 4 node cluster, running 0.8.2.1, that got into a bad state last night during a rolling restart. The first node to be restarted was the controller. Controlled Shutdown completed successfully, after about 800ms. But after that, the server failed to shutdown, and got into what appears to be an infinite cycle, involving the 'Controller-45-to-broker-45-send-thread', and the 'kafka-scheduler-0' thread. Ultimately, the shutdown timed out (after 3 minutes) and it was terminated by the deployment system, and the server was restarted. As expected, when it came back up it took some time re-syncing it's partitions, but eventually came back and all was well. However, I think there was something fundamentally wrong with the shutdown process. The controller didn't seem to give up controller status, for one thing, as part of the controlled shutdown (which I should think would be the first thing it should do?). Anyway, below are some log snippets. I do have full logs from each broker in the cluster, which I can provide (but would have to significantly anonymize the logs before forwarding along). Controlled shutdown starts and succeeds: {code} 2015-04-14 05:56:10,134 INFO [Thread-38] server.KafkaServer - [Kafka Server 45], shutting down 2015-04-14 05:56:10,136 INFO [Thread-38] server.KafkaServer - [Kafka Server 45], Starting controlled shutdown 2015-04-14 05:56:10,150 INFO [kafka-request-handler-0] controller.KafkaController - [Controller 45]: Shutting down broker 45 ... ... 2015-04-14 05:56:10,951 INFO [Thread-38] server.KafkaServer - [Kafka Server 45], Controlled shutdown succeeded {code} Subsequently, the 'Controller-45-to-broker-45-send-thread' starts repeatedly spamming the logs, like so: {code} 2015-04-14 05:56:11,281 WARN [Controller-45-to-broker-45-send-thread] controller.RequestSendThread - [Controller-45-to-broker-45-send-thread], Controller 45 epoch 21 fails to send request Name:LeaderAndIsrRequest;Version:0;Controller:45;ControllerEpoch:21;CorrelationId:305175;ClientId:id_45-host_null-port_12345;Leaders:id:48,host:broker48.square,port:12345;PartitionState:(mytopic,1) -> (LeaderAndIsrInfo:(Leader:48,ISR:48,LeaderEpoch:5,ControllerEpoch:21),ReplicationFactor:2),AllReplicas:45,48) to broker id:45,host:broker45.square,port:12345. Reconnecting to broker. java.nio.channels.ClosedChannelException at kafka.network.BlockingChannel.send(BlockingChannel.scala:100) at kafka.controller.RequestSendThread.liftedTree1$1(ControllerChannelManager.scala:132) at kafka.controller.RequestSendThread.doWork(ControllerChannelManager.scala:131) at kafka.utils.ShutdownableThread.run(ShutdownableThread.scala:60) 2015-04-14 05:56:11,281 INFO [Controller-45-to-broker-45-send-thread] controller.RequestSendThread - [Controller-45-to-broker-45-send-thread], Controller 45 connected to id:45,host:broker45.square,port:12345 for sending state change requests 2015-04-14 05:56:11,582 WARN [Controller-45-to-broker-45-send-thread] controller.RequestSendThread - [Controller-45-to-broker-45-send-thread], Controller 45 epoch 21 fails to send request Name:LeaderAndIsrRequest;Version:0;Controller:45;ControllerEpoch:21;CorrelationId:305175;ClientId:id_45-host_null-port_12345;Leaders:id:48,host:broker48.square,port:12345;PartitionState:(mytopic,1) -> (LeaderAndIsrInfo:(Leader:48,ISR:48,LeaderEpoch:5,ControllerEpoch:21),ReplicationFactor:2),AllReplicas:45,48) to broker id:45,host:broker45.square,port:12345. Reconnecting to broker. java.nio.channels.ClosedChannelException at kafka.network.BlockingChannel.send(BlockingChannel.scala:100) at kafka.controller.RequestSendThread.liftedTree1$1(ControllerChannelManager.scala:132) at kafka.controller.RequestSendThread.doWork(ControllerChannelManager.scala:131) at kafka.utils.ShutdownableThread.run(ShutdownableThread.scala:60) 2015-04-14 05:56:11,582 INFO [Controller-45-to-broker-45-send-thread] controller.RequestSendThread - [Controller-45-to-broker-45-send-thread], Controller 45 connected to id:45,host:broker45.square,port:12345 for sending state change requests {code} This seems to repeat every 300 ms or so, and continues until the server is forcibly shutdown (for about 3 minutes in this case). The partition being logged about there (anonymized as 'mytopic,1') is always the same in topic/partition in these repeated log messages. So perhaps that one partition got in a bad state. Interspersed, there are floods of error logs like these, which is repeated seemingly for every partition on the broker (the broker has about 450 partitions). It's trying to shrink the ISR to just having the current broker (EVEN THOUGH IT JUST COMPLETED CONTROLLED SHUTDOWN for all partitions). The following sequence goes on in bursts of 500ms, once for each partition on the box (and the recurring every 10 seconds): {code} 2015-04-14 05:56:23,716 INFO [kafka-scheduler-0] cluster.Partition - Partition [mytopic2,0] on broker 45: Shrinking ISR for partition [mytopic2,0] from 46,45 to 45 2015-04-14 05:56:23,720 INFO [kafka-scheduler-0] cluster.Partition - Partition [mytopic2,0] on broker 45: Cached zkVersion [19] not equal to that in zookeeper, skip updating ISR 2015-04-14 05:56:23,720 INFO [kafka-scheduler-0] cluster.Partition - Partition [mytopic3,13] on broker 45: Shrinking ISR for partition [mytopic3,13] from 45,48 to 45 2015-04-14 05:56:23,723 INFO [kafka-scheduler-0] cluster.Partition - Partition [mytopic3,13] on broker 45: Cached zkVersion [8] not equal to that in zookeeper, skip updating ISR 2015-04-14 05:56:23,723 INFO [kafka-scheduler-0] cluster.Partition - Partition [mytopic4,12] on broker 45: Shrinking ISR for partition [mytopic4,12] from 46,45 to 45 2015-04-14 05:56:23,726 INFO [kafka-scheduler-0] cluster.Partition - Partition [mytopic4,12] on broker 45: Cached zkVersion [19] not equal to that in zookeeper, skip updating ISR {code} The above sequence is repeated every 10 seconds, it seems. So there appear to be 2 different repeating non-terminating loops: 1. the Controller-45-to-broker-45-send-thread repeating every 300ms 2. the kafka-scheduler thread attempting to shrink the ISR for every partition the box, every 10 seconds. It seems clear that this sequence would recur indefinitely had the app not been killed hard. -- This message was sent by Atlassian JIRA (v6.3.4#6332)