Jason Rosenberg created KAFKA-2125:
--------------------------------------
Summary: Infinite loop after controlled shutdown succeeds
Key: KAFKA-2125
URL: https://issues.apache.org/jira/browse/KAFKA-2125
Project: Kafka
Issue Type: Bug
Affects Versions: 0.8.2.1
Reporter: Jason Rosenberg
Priority: Blocker
I have a 4 node cluster, running 0.8.2.1, that got into a bad state last night
during a rolling restart. The first node to be restarted was the controller.
Controlled Shutdown completed successfully, after about 800ms. But after that,
the server failed to shutdown, and got into what appears to be an infinite
cycle, involving the 'Controller-45-to-broker-45-send-thread', and the
'kafka-scheduler-0' thread. Ultimately, the shutdown timed out (after 3
minutes) and it was terminated by the deployment system, and the server was
restarted. As expected, when it came back up it took some time re-syncing it's
partitions, but eventually came back and all was well. However, I think there
was something fundamentally wrong with the shutdown process. The controller
didn't seem to give up controller status, for one thing, as part of the
controlled shutdown (which I should think would be the first thing it should
do?).
Anyway, below are some log snippets. I do have full logs from each broker in
the cluster, which I can provide (but would have to significantly anonymize the
logs before forwarding along).
Controlled shutdown starts and succeeds:
{code}
2015-04-14 05:56:10,134 INFO [Thread-38] server.KafkaServer - [Kafka Server
45], shutting down
2015-04-14 05:56:10,136 INFO [Thread-38] server.KafkaServer - [Kafka Server
45], Starting controlled shutdown
2015-04-14 05:56:10,150 INFO [kafka-request-handler-0]
controller.KafkaController - [Controller 45]: Shutting down broker 45
...
...
2015-04-14 05:56:10,951 INFO [Thread-38] server.KafkaServer - [Kafka Server
45], Controlled shutdown succeeded
{code}
Subsequently, the 'Controller-45-to-broker-45-send-thread' starts repeatedly
spamming the logs, like so:
{code}
2015-04-14 05:56:11,281 WARN [Controller-45-to-broker-45-send-thread]
controller.RequestSendThread - [Controller-45-to-broker-45-send-thread],
Controller 45 epoch 21 fails to send request
Name:LeaderAndIsrRequest;Version:0;Controller:45;ControllerEpoch:21;CorrelationId:305175;ClientId:id_45-host_null-port_12345;Leaders:id:48,host:broker48.square,port:12345;PartitionState:(mytopic,1)
->
(LeaderAndIsrInfo:(Leader:48,ISR:48,LeaderEpoch:5,ControllerEpoch:21),ReplicationFactor:2),AllReplicas:45,48)
to broker id:45,host:broker45.square,port:12345. Reconnecting to broker.
java.nio.channels.ClosedChannelException
at kafka.network.BlockingChannel.send(BlockingChannel.scala:100)
at
kafka.controller.RequestSendThread.liftedTree1$1(ControllerChannelManager.scala:132)
at
kafka.controller.RequestSendThread.doWork(ControllerChannelManager.scala:131)
at kafka.utils.ShutdownableThread.run(ShutdownableThread.scala:60)
2015-04-14 05:56:11,281 INFO [Controller-45-to-broker-45-send-thread]
controller.RequestSendThread - [Controller-45-to-broker-45-send-thread],
Controller 45 connected to id:45,host:broker45.square,port:12345 for sending
state change requests
2015-04-14 05:56:11,582 WARN [Controller-45-to-broker-45-send-thread]
controller.RequestSendThread - [Controller-45-to-broker-45-send-thread],
Controller 45 epoch 21 fails to send request
Name:LeaderAndIsrRequest;Version:0;Controller:45;ControllerEpoch:21;CorrelationId:305175;ClientId:id_45-host_null-port_12345;Leaders:id:48,host:broker48.square,port:12345;PartitionState:(mytopic,1)
->
(LeaderAndIsrInfo:(Leader:48,ISR:48,LeaderEpoch:5,ControllerEpoch:21),ReplicationFactor:2),AllReplicas:45,48)
to broker id:45,host:broker45.square,port:12345. Reconnecting to broker.
java.nio.channels.ClosedChannelException
at kafka.network.BlockingChannel.send(BlockingChannel.scala:100)
at
kafka.controller.RequestSendThread.liftedTree1$1(ControllerChannelManager.scala:132)
at
kafka.controller.RequestSendThread.doWork(ControllerChannelManager.scala:131)
at kafka.utils.ShutdownableThread.run(ShutdownableThread.scala:60)
2015-04-14 05:56:11,582 INFO [Controller-45-to-broker-45-send-thread]
controller.RequestSendThread - [Controller-45-to-broker-45-send-thread],
Controller 45 connected to id:45,host:broker45.square,port:12345 for sending
state change requests
{code}
This seems to repeat every 300 ms or so, and continues until the server is
forcibly shutdown (for about 3 minutes in this case). The partition being
logged about there (anonymized as 'mytopic,1') is always the same in
topic/partition in these repeated log messages. So perhaps that one partition
got in a bad state.
Interspersed, there are floods of error logs like these, which is repeated
seemingly for every partition on the broker (the broker has about 450
partitions). It's trying to shrink the ISR to just having the current broker
(EVEN THOUGH IT JUST COMPLETED CONTROLLED SHUTDOWN for all partitions). The
following sequence goes on in bursts of 500ms, once for each partition on the
box (and the recurring every 10 seconds):
{code}
2015-04-14 05:56:23,716 INFO [kafka-scheduler-0] cluster.Partition - Partition
[mytopic2,0] on broker 45: Shrinking ISR for partition [mytopic2,0] from 46,45
to 45
2015-04-14 05:56:23,720 INFO [kafka-scheduler-0] cluster.Partition - Partition
[mytopic2,0] on broker 45: Cached zkVersion [19] not equal to that in
zookeeper, skip updating ISR
2015-04-14 05:56:23,720 INFO [kafka-scheduler-0] cluster.Partition - Partition
[mytopic3,13] on broker 45: Shrinking ISR for partition [mytopic3,13] from
45,48 to 45
2015-04-14 05:56:23,723 INFO [kafka-scheduler-0] cluster.Partition - Partition
[mytopic3,13] on broker 45: Cached zkVersion [8] not equal to that in
zookeeper, skip updating ISR
2015-04-14 05:56:23,723 INFO [kafka-scheduler-0] cluster.Partition - Partition
[mytopic4,12] on broker 45: Shrinking ISR for partition [mytopic4,12] from
46,45 to 45
2015-04-14 05:56:23,726 INFO [kafka-scheduler-0] cluster.Partition - Partition
[mytopic4,12] on broker 45: Cached zkVersion [19] not equal to that in
zookeeper, skip updating ISR
{code}
The above sequence is repeated every 10 seconds, it seems.
So there appear to be 2 different repeating non-terminating loops:
1. the Controller-45-to-broker-45-send-thread repeating every 300ms
2. the kafka-scheduler thread attempting to shrink the ISR for every partition
the box, every 10 seconds.
It seems clear that this sequence would recur indefinitely had the app not been
killed hard.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)