[ 
https://issues.apache.org/jira/browse/KAFKA-2125?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14496883#comment-14496883
 ] 

Jason Rosenberg commented on KAFKA-2125:
----------------------------------------

Joel, unfortunately I have no stack trace, and the tack traces I listed above 
are the full stack traces that I have......Going forward, we've removed the 3 
minute automatic jvm shutdown, so next time, if it spins out of control, I can 
manually pull a thread dump.

> Infinite loop after controlled shutdown succeeds
> ------------------------------------------------
>
>                 Key: KAFKA-2125
>                 URL: https://issues.apache.org/jira/browse/KAFKA-2125
>             Project: Kafka
>          Issue Type: Bug
>    Affects Versions: 0.8.2.1
>            Reporter: Jason Rosenberg
>            Priority: Blocker
>         Attachments: grep_shut_edited.log
>
>
> I have a 4 node cluster, running 0.8.2.1, that got into a bad state last 
> night during a rolling restart.  The first node to be restarted was the 
> controller.  Controlled Shutdown completed successfully, after about 800ms.  
> But after that, the server failed to shutdown, and got into what appears to 
> be an infinite cycle, involving the 'Controller-45-to-broker-45-send-thread', 
> and the 'kafka-scheduler-0' thread.  Ultimately, the shutdown timed out 
> (after 3 minutes) and it was terminated by the deployment system, and the 
> server was restarted.  As expected, when it came back up it took some time 
> re-syncing it's partitions, but eventually came back and all was well.  
> However, I think there was something fundamentally wrong with the shutdown 
> process.  The controller didn't seem to give up controller status, for one 
> thing, as part of the controlled shutdown (which I should think would be the 
> first thing it should do?).
> Anyway, below are some log snippets. I do have full logs from each broker in 
> the cluster, which I can provide (but would have to significantly anonymize 
> the logs before forwarding along).
> Controlled shutdown starts and succeeds:
> {code}
> 2015-04-14 05:56:10,134  INFO [Thread-38] server.KafkaServer - [Kafka Server 
> 45], shutting down
> 2015-04-14 05:56:10,136  INFO [Thread-38] server.KafkaServer - [Kafka Server 
> 45], Starting controlled shutdown
> 2015-04-14 05:56:10,150  INFO [kafka-request-handler-0] 
> controller.KafkaController - [Controller 45]: Shutting down broker 45
> ...
> ...
> 2015-04-14 05:56:10,951  INFO [Thread-38] server.KafkaServer - [Kafka Server 
> 45], Controlled shutdown succeeded
> {code}
> Subsequently, the 'Controller-45-to-broker-45-send-thread' starts repeatedly 
> spamming the logs, like so:
> {code}
> 2015-04-14 05:56:11,281  WARN [Controller-45-to-broker-45-send-thread] 
> controller.RequestSendThread - [Controller-45-to-broker-45-send-thread], 
> Controller 45 epoch 21 fails to send request 
> Name:LeaderAndIsrRequest;Version:0;Controller:45;ControllerEpoch:21;CorrelationId:305175;ClientId:id_45-host_null-port_12345;Leaders:id:48,host:broker48.square,port:12345;PartitionState:(mytopic,1)
>  -> 
> (LeaderAndIsrInfo:(Leader:48,ISR:48,LeaderEpoch:5,ControllerEpoch:21),ReplicationFactor:2),AllReplicas:45,48)
>  to broker id:45,host:broker45.square,port:12345. Reconnecting to broker.
> java.nio.channels.ClosedChannelException
>         at kafka.network.BlockingChannel.send(BlockingChannel.scala:100)
>         at 
> kafka.controller.RequestSendThread.liftedTree1$1(ControllerChannelManager.scala:132)
>         at 
> kafka.controller.RequestSendThread.doWork(ControllerChannelManager.scala:131)
>         at kafka.utils.ShutdownableThread.run(ShutdownableThread.scala:60)
> 2015-04-14 05:56:11,281  INFO [Controller-45-to-broker-45-send-thread] 
> controller.RequestSendThread - [Controller-45-to-broker-45-send-thread], 
> Controller 45 connected to id:45,host:broker45.square,port:12345 for sending 
> state change requests
> 2015-04-14 05:56:11,582  WARN [Controller-45-to-broker-45-send-thread] 
> controller.RequestSendThread - [Controller-45-to-broker-45-send-thread], 
> Controller 45 epoch 21 fails to send request 
> Name:LeaderAndIsrRequest;Version:0;Controller:45;ControllerEpoch:21;CorrelationId:305175;ClientId:id_45-host_null-port_12345;Leaders:id:48,host:broker48.square,port:12345;PartitionState:(mytopic,1)
>  -> 
> (LeaderAndIsrInfo:(Leader:48,ISR:48,LeaderEpoch:5,ControllerEpoch:21),ReplicationFactor:2),AllReplicas:45,48)
>  to broker id:45,host:broker45.square,port:12345. Reconnecting to broker.
> java.nio.channels.ClosedChannelException
>         at kafka.network.BlockingChannel.send(BlockingChannel.scala:100)
>         at 
> kafka.controller.RequestSendThread.liftedTree1$1(ControllerChannelManager.scala:132)
>         at 
> kafka.controller.RequestSendThread.doWork(ControllerChannelManager.scala:131)
>         at kafka.utils.ShutdownableThread.run(ShutdownableThread.scala:60)
> 2015-04-14 05:56:11,582  INFO [Controller-45-to-broker-45-send-thread] 
> controller.RequestSendThread - [Controller-45-to-broker-45-send-thread], 
> Controller 45 connected to id:45,host:broker45.square,port:12345 for sending 
> state change requests
> {code}
> This seems to repeat every 300 ms or so, and continues until the server is 
> forcibly shutdown (for about 3 minutes in this case).  The partition being 
> logged about there (anonymized as 'mytopic,1') is always the same in 
> topic/partition in these repeated log messages. So perhaps that one partition 
> got in a bad state.
> Interspersed, there are floods of error logs like these, which is repeated 
> seemingly for every partition on the broker (the broker has about 450 
> partitions).  It's trying to shrink the ISR to just having the current broker 
> (EVEN THOUGH IT JUST COMPLETED CONTROLLED SHUTDOWN for all partitions). The 
> following sequence goes on in bursts of 500ms, once for each partition on the 
> box (and the recurring every 10 seconds):
> {code}
> 2015-04-14 05:56:23,716  INFO [kafka-scheduler-0] cluster.Partition - 
> Partition [mytopic2,0] on broker 45: Shrinking ISR for partition [mytopic2,0] 
> from 46,45 to 45
> 2015-04-14 05:56:23,720  INFO [kafka-scheduler-0] cluster.Partition - 
> Partition [mytopic2,0] on broker 45: Cached zkVersion [19] not equal to that 
> in zookeeper, skip updating ISR
> 2015-04-14 05:56:23,720  INFO [kafka-scheduler-0] cluster.Partition - 
> Partition [mytopic3,13] on broker 45: Shrinking ISR for partition 
> [mytopic3,13] from 45,48 to 45
> 2015-04-14 05:56:23,723  INFO [kafka-scheduler-0] cluster.Partition - 
> Partition [mytopic3,13] on broker 45: Cached zkVersion [8] not equal to that 
> in zookeeper, skip updating ISR
> 2015-04-14 05:56:23,723  INFO [kafka-scheduler-0] cluster.Partition - 
> Partition [mytopic4,12] on broker 45: Shrinking ISR for partition 
> [mytopic4,12] from 46,45 to 45
> 2015-04-14 05:56:23,726  INFO [kafka-scheduler-0] cluster.Partition - 
> Partition [mytopic4,12] on broker 45: Cached zkVersion [19] not equal to that 
> in zookeeper, skip updating ISR
> {code}
> The above sequence is repeated every 10 seconds, it seems.
> So there appear to be 2 different repeating non-terminating loops:
> 1. the Controller-45-to-broker-45-send-thread repeating every 300ms
> 2. the kafka-scheduler thread attempting to shrink the ISR for every 
> partition the box, every 10 seconds.
> It seems clear that this sequence would recur indefinitely had the app not 
> been killed hard.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to