[ https://issues.apache.org/jira/browse/KAFKA-2125?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14496865#comment-14496865 ]
Joel Koshy commented on KAFKA-2125: ----------------------------------- For some reason, the socket server's processors shutdown got wedged which prevented the controller, replica-manager, etc. from shutting down. This fully explains the repetitive logs. A full threaddump would help understand why the socket server shutdown wedged - do you have the full dump? Also, do you have the full stack for the line that repeats 500 times in your attached grep output? > Infinite loop after controlled shutdown succeeds > ------------------------------------------------ > > Key: KAFKA-2125 > URL: https://issues.apache.org/jira/browse/KAFKA-2125 > Project: Kafka > Issue Type: Bug > Affects Versions: 0.8.2.1 > Reporter: Jason Rosenberg > Priority: Blocker > Attachments: grep_shut_edited.log > > > I have a 4 node cluster, running 0.8.2.1, that got into a bad state last > night during a rolling restart. The first node to be restarted was the > controller. Controlled Shutdown completed successfully, after about 800ms. > But after that, the server failed to shutdown, and got into what appears to > be an infinite cycle, involving the 'Controller-45-to-broker-45-send-thread', > and the 'kafka-scheduler-0' thread. Ultimately, the shutdown timed out > (after 3 minutes) and it was terminated by the deployment system, and the > server was restarted. As expected, when it came back up it took some time > re-syncing it's partitions, but eventually came back and all was well. > However, I think there was something fundamentally wrong with the shutdown > process. The controller didn't seem to give up controller status, for one > thing, as part of the controlled shutdown (which I should think would be the > first thing it should do?). > Anyway, below are some log snippets. I do have full logs from each broker in > the cluster, which I can provide (but would have to significantly anonymize > the logs before forwarding along). > Controlled shutdown starts and succeeds: > {code} > 2015-04-14 05:56:10,134 INFO [Thread-38] server.KafkaServer - [Kafka Server > 45], shutting down > 2015-04-14 05:56:10,136 INFO [Thread-38] server.KafkaServer - [Kafka Server > 45], Starting controlled shutdown > 2015-04-14 05:56:10,150 INFO [kafka-request-handler-0] > controller.KafkaController - [Controller 45]: Shutting down broker 45 > ... > ... > 2015-04-14 05:56:10,951 INFO [Thread-38] server.KafkaServer - [Kafka Server > 45], Controlled shutdown succeeded > {code} > Subsequently, the 'Controller-45-to-broker-45-send-thread' starts repeatedly > spamming the logs, like so: > {code} > 2015-04-14 05:56:11,281 WARN [Controller-45-to-broker-45-send-thread] > controller.RequestSendThread - [Controller-45-to-broker-45-send-thread], > Controller 45 epoch 21 fails to send request > Name:LeaderAndIsrRequest;Version:0;Controller:45;ControllerEpoch:21;CorrelationId:305175;ClientId:id_45-host_null-port_12345;Leaders:id:48,host:broker48.square,port:12345;PartitionState:(mytopic,1) > -> > (LeaderAndIsrInfo:(Leader:48,ISR:48,LeaderEpoch:5,ControllerEpoch:21),ReplicationFactor:2),AllReplicas:45,48) > to broker id:45,host:broker45.square,port:12345. Reconnecting to broker. > java.nio.channels.ClosedChannelException > at kafka.network.BlockingChannel.send(BlockingChannel.scala:100) > at > kafka.controller.RequestSendThread.liftedTree1$1(ControllerChannelManager.scala:132) > at > kafka.controller.RequestSendThread.doWork(ControllerChannelManager.scala:131) > at kafka.utils.ShutdownableThread.run(ShutdownableThread.scala:60) > 2015-04-14 05:56:11,281 INFO [Controller-45-to-broker-45-send-thread] > controller.RequestSendThread - [Controller-45-to-broker-45-send-thread], > Controller 45 connected to id:45,host:broker45.square,port:12345 for sending > state change requests > 2015-04-14 05:56:11,582 WARN [Controller-45-to-broker-45-send-thread] > controller.RequestSendThread - [Controller-45-to-broker-45-send-thread], > Controller 45 epoch 21 fails to send request > Name:LeaderAndIsrRequest;Version:0;Controller:45;ControllerEpoch:21;CorrelationId:305175;ClientId:id_45-host_null-port_12345;Leaders:id:48,host:broker48.square,port:12345;PartitionState:(mytopic,1) > -> > (LeaderAndIsrInfo:(Leader:48,ISR:48,LeaderEpoch:5,ControllerEpoch:21),ReplicationFactor:2),AllReplicas:45,48) > to broker id:45,host:broker45.square,port:12345. Reconnecting to broker. > java.nio.channels.ClosedChannelException > at kafka.network.BlockingChannel.send(BlockingChannel.scala:100) > at > kafka.controller.RequestSendThread.liftedTree1$1(ControllerChannelManager.scala:132) > at > kafka.controller.RequestSendThread.doWork(ControllerChannelManager.scala:131) > at kafka.utils.ShutdownableThread.run(ShutdownableThread.scala:60) > 2015-04-14 05:56:11,582 INFO [Controller-45-to-broker-45-send-thread] > controller.RequestSendThread - [Controller-45-to-broker-45-send-thread], > Controller 45 connected to id:45,host:broker45.square,port:12345 for sending > state change requests > {code} > This seems to repeat every 300 ms or so, and continues until the server is > forcibly shutdown (for about 3 minutes in this case). The partition being > logged about there (anonymized as 'mytopic,1') is always the same in > topic/partition in these repeated log messages. So perhaps that one partition > got in a bad state. > Interspersed, there are floods of error logs like these, which is repeated > seemingly for every partition on the broker (the broker has about 450 > partitions). It's trying to shrink the ISR to just having the current broker > (EVEN THOUGH IT JUST COMPLETED CONTROLLED SHUTDOWN for all partitions). The > following sequence goes on in bursts of 500ms, once for each partition on the > box (and the recurring every 10 seconds): > {code} > 2015-04-14 05:56:23,716 INFO [kafka-scheduler-0] cluster.Partition - > Partition [mytopic2,0] on broker 45: Shrinking ISR for partition [mytopic2,0] > from 46,45 to 45 > 2015-04-14 05:56:23,720 INFO [kafka-scheduler-0] cluster.Partition - > Partition [mytopic2,0] on broker 45: Cached zkVersion [19] not equal to that > in zookeeper, skip updating ISR > 2015-04-14 05:56:23,720 INFO [kafka-scheduler-0] cluster.Partition - > Partition [mytopic3,13] on broker 45: Shrinking ISR for partition > [mytopic3,13] from 45,48 to 45 > 2015-04-14 05:56:23,723 INFO [kafka-scheduler-0] cluster.Partition - > Partition [mytopic3,13] on broker 45: Cached zkVersion [8] not equal to that > in zookeeper, skip updating ISR > 2015-04-14 05:56:23,723 INFO [kafka-scheduler-0] cluster.Partition - > Partition [mytopic4,12] on broker 45: Shrinking ISR for partition > [mytopic4,12] from 46,45 to 45 > 2015-04-14 05:56:23,726 INFO [kafka-scheduler-0] cluster.Partition - > Partition [mytopic4,12] on broker 45: Cached zkVersion [19] not equal to that > in zookeeper, skip updating ISR > {code} > The above sequence is repeated every 10 seconds, it seems. > So there appear to be 2 different repeating non-terminating loops: > 1. the Controller-45-to-broker-45-send-thread repeating every 300ms > 2. the kafka-scheduler thread attempting to shrink the ISR for every > partition the box, every 10 seconds. > It seems clear that this sequence would recur indefinitely had the app not > been killed hard. -- This message was sent by Atlassian JIRA (v6.3.4#6332)