[ 
https://issues.apache.org/jira/browse/KAFKA-3577?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ray Chiang resolved KAFKA-3577.
-------------------------------
    Resolution: Duplicate

> Partial cluster breakdown
> -------------------------
>
>                 Key: KAFKA-3577
>                 URL: https://issues.apache.org/jira/browse/KAFKA-3577
>             Project: Kafka
>          Issue Type: Bug
>    Affects Versions: 0.9.0.1
>         Environment: Debian GNU/Linux 7.9 (wheezy)
>            Reporter: Kim Christensen
>            Priority: Major
>
> We run a cluster of 3 brokers and 3 zookeepers, but after we upgraded to 
> 0.9.0.1 our cluster sometimes goes partially down, and we can't figure why. A 
> full cluster restart fixed the problem.
> I've added a snippet of the logs on each broker below.
> Broker 4:
> {quote}
> [2016-04-18 05:58:26,390] INFO [Group Metadata Manager on Broker 4]: Removed 
> 0 expired offsets in 0 milliseconds. (kafka.coordinator.GroupMetadataManager)
> [2016-04-18 06:05:55,218] INFO Creating /controller (is it secure? false) 
> (kafka.utils.ZKCheckedEphemeral)
> [2016-04-18 06:05:57,396] ERROR Session has expired while creating 
> /controller (kafka.utils.ZKCheckedEphemeral)
> [2016-04-18 06:05:57,396] INFO Result of znode creation is: SESSIONEXPIRED 
> (kafka.utils.ZKCheckedEphemeral)
> [2016-04-18 06:05:57,400] ERROR Error while electing or becoming leader on 
> broker 4 (kafka.server.ZookeeperLeaderElector)
> org.I0Itec.zkclient.exception.ZkException: 
> org.apache.zookeeper.KeeperException$SessionExpiredException: KeeperErrorCode 
> = Session expired
>         at 
> org.I0Itec.zkclient.exception.ZkException.create(ZkException.java:68)
>         at kafka.utils.ZKCheckedEphemeral.create(ZkUtils.scala:1090)
>         at 
> kafka.server.ZookeeperLeaderElector.elect(ZookeeperLeaderElector.scala:81)
>         at 
> kafka.server.ZookeeperLeaderElector$LeaderChangeListener$$anonfun$handleDataDeleted$1.apply$mcZ$sp(ZookeeperLeaderElector.scala:146)
>         at 
> kafka.server.ZookeeperLeaderElector$LeaderChangeListener$$anonfun$handleDataDeleted$1.apply(ZookeeperLeaderElector.scala:141)
>         at 
> kafka.server.ZookeeperLeaderElector$LeaderChangeListener$$anonfun$handleDataDeleted$1.apply(ZookeeperLeaderElector.scala:141)
>         at kafka.utils.CoreUtils$.inLock(CoreUtils.scala:262)
>         at 
> kafka.server.ZookeeperLeaderElector$LeaderChangeListener.handleDataDeleted(ZookeeperLeaderElector.scala:141)
>         at org.I0Itec.zkclient.ZkClient$9.run(ZkClient.java:823)
>         at org.I0Itec.zkclient.ZkEventThread.run(ZkEventThread.java:71)
> Caused by: org.apache.zookeeper.KeeperException$SessionExpiredException: 
> KeeperErrorCode = Session expired
>         at 
> org.apache.zookeeper.KeeperException.create(KeeperException.java:127)
>         ... 9 more
> [2016-04-18 06:05:57,420] INFO Creating /controller (is it secure? false) 
> (kafka.utils.ZKCheckedEphemeral)
> [2016-04-18 06:05:57,424] INFO Result of znode creation is: OK 
> (kafka.utils.ZKCheckedEphemeral)
> [2016-04-18 06:05:57,425] INFO 4 successfully elected as leader 
> (kafka.server.ZookeeperLeaderElector)
> [2016-04-18 06:05:57,885] INFO [ReplicaFetcherManager on broker 4] Removed 
> fetcher for partitions 
> [__consumer_offsets,32],[__consumer_offsets,44],[cdrrecords-errors,1],[cdrrecords,0],[__consumer_offsets,38],[__consumer_offsets,8],[events
> ,2],[__consumer_offsets,20],[__consumer_offsets,2],[__consumer_offsets,14],[__consumer_offsets,26]
>  (kafka.server.ReplicaFetcherManager)
> [2016-04-18 06:05:57,892] INFO [ReplicaFetcherManager on broker 4] Removed 
> fetcher for partitions 
> [__consumer_offsets,35],[__consumer_offsets,23],[__consumer_offsets,47],[__consumer_offsets,11],[__consumer_offsets,5],[events-errors,2],[_
> _consumer_offsets,17],[__consumer_offsets,41],[__consumer_offsets,29] 
> (kafka.server.ReplicaFetcherManager)
> [2016-04-18 06:05:57,894] INFO Truncating log __consumer_offsets-17 to offset 
> 0. (kafka.log.Log)
> [2016-04-18 06:05:57,894] INFO Truncating log __consumer_offsets-23 to offset 
> 0. (kafka.log.Log)
> [2016-04-18 06:05:57,894] INFO Truncating log __consumer_offsets-29 to offset 
> 0. (kafka.log.Log)
> [2016-04-18 06:05:57,895] INFO Truncating log __consumer_offsets-35 to offset 
> 0. (kafka.log.Log)
> [2016-04-18 06:05:57,895] INFO Truncating log __consumer_offsets-41 to offset 
> 0. (kafka.log.Log)
> [2016-04-18 06:05:57,895] INFO Truncating log events-errors-2 to offset 0. 
> (kafka.log.Log)
> [2016-04-18 06:05:57,895] INFO Truncating log __consumer_offsets-5 to offset 
> 0. (kafka.log.Log)
> [2016-04-18 06:05:57,895] INFO Truncating log __consumer_offsets-11 to offset 
> 0. (kafka.log.Log)
> [2016-04-18 06:05:57,896] INFO Truncating log __consumer_offsets-47 to offset 
> 0. (kafka.log.Log)
> [2016-04-18 06:05:57,904] INFO [ReplicaFetcherManager on broker 4] Added 
> fetcher for partitions List([[__consumer_offsets,17], initOffset 0 to broker 
> BrokerEndPoint(5,kafka304.prod.local,9092)] , [[__consumer_offsets,23], 
> initOffset 0 to broker BrokerEndPoint(5,kafka304.prod.local,9092)] , 
> [[__consumer_offsets,29], initOffset 0 to broker 
> BrokerEndPoint(5,kafka304.prod.local,9092)] , [[__consumer_offsets,35], 
> initOffset 0 to broker BrokerEndPoint(5,kafka304.prod.local,9092)] , 
> [[__consumer_offsets,41], initOffset 0 to broker 
> BrokerEndPoint(5,kafka304.prod.local,9092)] , [[events-errors,2], initOffset 
> 0 to broker BrokerEndPoint(5,kafka304.prod.local,9092)] , 
> [[__consumer_offsets,5], initOffset 0 to broker 
> BrokerEndPoint(5,kafka304.prod.local,9092)] , [[__consumer_offsets,11], 
> initOffset 0 to broker BrokerEndPoint(5,kafka304.prod.local,9092)] , 
> [[__consumer_offsets,47], initOffset 0 to broker 
> BrokerEndPoint(5,kafka304.prod.local,9092)] ) 
> (kafka.server.ReplicaFetcherManager)
> [2016-04-18 06:05:57,908] INFO [ReplicaFetcherThread-0-6], Shutting down 
> (kafka.server.ReplicaFetcherThread)
> [2016-04-18 06:05:58,783] INFO re-registering broker info in ZK for broker 4 
> (kafka.server.KafkaHealthcheck)
> [2016-04-18 06:05:58,784] INFO Creating /brokers/ids/4 (is it secure? false) 
> (kafka.utils.ZKCheckedEphemeral)
> [2016-04-18 06:05:58,799] INFO Result of znode creation is: OK 
> (kafka.utils.ZKCheckedEphemeral)
> [2016-04-18 06:05:58,799] INFO Registered broker 4 at path /brokers/ids/4 
> with addresses: PLAINTEXT -> EndPoint(kafka303.prod.local,9092,PLAINTEXT) 
> (kafka.utils.ZkUtils)
> [2016-04-18 06:05:58,799] INFO done re-registering broker 
> (kafka.server.KafkaHealthcheck)
> [2016-04-18 06:05:58,799] INFO Subscribing to /brokers/topics path to watch 
> for new topics (kafka.server.KafkaHealthcheck)
> [2016-04-18 06:05:58,804] INFO New leader is 4 
> (kafka.server.ZookeeperLeaderElector$LeaderChangeListener)
> [2016-04-18 06:05:58,807] INFO New leader is 4 
> (kafka.server.ZookeeperLeaderElector$LeaderChangeListener)
> [2016-04-18 06:06:03,895] INFO [ReplicaFetcherThread-0-6], Stopped  
> (kafka.server.ReplicaFetcherThread)
> [2016-04-18 06:06:03,895] INFO [ReplicaFetcherThread-0-6], Shutdown completed 
> (kafka.server.ReplicaFetcherThread)
> {quote}
> Broker 5:
> {quote}
> [2016-04-18 05:57:05,926] INFO [Group Metadata Manager on Broker 5]: Removed 
> 0 expired offsets in 0 milliseconds. (kafka.coordinator.GroupMetadataManager)
> [2016-04-18 06:05:56,304] INFO Creating /controller (is it secure? false) 
> (kafka.utils.ZKCheckedEphemeral)
> [2016-04-18 06:05:56,309] INFO Result of znode creation is: OK 
> (kafka.utils.ZKCheckedEphemeral)
> [2016-04-18 06:05:56,312] INFO 5 successfully elected as leader 
> (kafka.server.ZookeeperLeaderElector)
> [2016-04-18 06:05:58,545] INFO [ReplicaFetcherManager on broker 5] Removed 
> fetcher for partitions 
> [__consumer_offsets,17],[__consumer_offsets,23],[__consumer_offsets,29],[__consumer_offsets,35],[__consumer_offsets,41],[events-errors,2],[
> __consumer_offsets,5],[__consumer_offsets,11],[__consumer_offsets,47] 
> (kafka.server.ReplicaFetcherManager)
> [2016-04-18 06:05:58,554] INFO [ReplicaFetcherManager on broker 5] Removed 
> fetcher for partitions 
> [__consumer_offsets,2],[__consumer_offsets,38],[__consumer_offsets,44],[events,2],[__consumer_offsets,32],[cdrrecords-errors,1],[__consumer
> _offsets,8],[__consumer_offsets,20],[cdrrecords,0],[__consumer_offsets,14],[__consumer_offsets,26]
>  (kafka.server.ReplicaFetcherManager)
> [2016-04-18 06:05:58,556] INFO Truncating log __consumer_offsets-32 to offset 
> 0. (kafka.log.Log)
> [2016-04-18 06:05:58,556] INFO Truncating log __consumer_offsets-44 to offset 
> 0. (kafka.log.Log)
> [2016-04-18 06:05:58,556] INFO Truncating log cdrrecords-errors-1 to offset 
> 8. (kafka.log.Log)
> [2016-04-18 06:05:58,556] INFO Truncating log cdrrecords-0 to offset 172892. 
> (kafka.log.Log)
> [2016-04-18 06:05:58,557] INFO Truncating log __consumer_offsets-38 to offset 
> 0. (kafka.log.Log)
> [2016-04-18 06:05:58,557] INFO Truncating log __consumer_offsets-8 to offset 
> 0. (kafka.log.Log)
> [2016-04-18 06:05:58,557] INFO Truncating log events-2 to offset 29954. 
> (kafka.log.Log)
> [2016-04-18 06:05:58,557] INFO Truncating log __consumer_offsets-20 to offset 
> 0. (kafka.log.Log)
> [2016-04-18 06:05:58,558] INFO Truncating log __consumer_offsets-2 to offset 
> 0. (kafka.log.Log)
> [2016-04-18 06:05:58,558] INFO Truncating log __consumer_offsets-14 to offset 
> 0. (kafka.log.Log)
> [2016-04-18 06:05:58,558] INFO Truncating log __consumer_offsets-26 to offset 
> 0. (kafka.log.Log)
> [2016-04-18 06:05:58,565] INFO [ReplicaFetcherManager on broker 5] Added 
> fetcher for partitions List([[__consumer_offsets,32], initOffset 0 to broker 
> BrokerEndPoint(4,kafka303.prod.local,9092)] , [[__consumer_offsets,44], 
> initOffset 0 to
>  broker BrokerEndPoint(4,kafka303.prod.local,9092)] , [[cdrrecords-errors,1], 
> initOffset 8 to broker BrokerEndPoint(4,kafka303.prod.local,9092)] , 
> [[cdrrecords,0], initOffset 172892 to broker 
> BrokerEndPoint(4,kafka303.prod.local,9092)] ,
>  [[__consumer_offsets,38], initOffset 0 to broker 
> BrokerEndPoint(4,kafka303.prod.local,9092)] , [[__consumer_offsets,8], 
> initOffset 0 to broker BrokerEndPoint(4,kafka303.prod.local,9092)] , 
> [[events,2], initOffset 29954 to broker BrokerE
> ndPoint(4,kafka303.prod.local,9092)] , [[__consumer_offsets,20], initOffset 0 
> to broker BrokerEndPoint(4,kafka303.prod.local,9092)] , 
> [[__consumer_offsets,2], initOffset 0 to broker 
> BrokerEndPoint(4,kafka303.prod.local,9092)] , [[__consu
> mer_offsets,14], initOffset 0 to broker 
> BrokerEndPoint(4,kafka303.prod.local,9092)] , [[__consumer_offsets,26], 
> initOffset 0 to broker BrokerEndPoint(4,kafka303.prod.local,9092)] ) 
> (kafka.server.ReplicaFetcherManager)
> [2016-04-18 06:05:58,568] INFO [ReplicaFetcherThread-0-6], Shutting down 
> (kafka.server.ReplicaFetcherThread)
> [2016-04-18 06:05:58,613] INFO New leader is 4 
> (kafka.server.ZookeeperLeaderElector$LeaderChangeListener)
> [2016-04-18 06:06:04,571] INFO [ReplicaFetcherThread-0-6], Stopped  
> (kafka.server.ReplicaFetcherThread)
> [2016-04-18 06:06:04,571] INFO [ReplicaFetcherThread-0-6], Shutdown completed 
> (kafka.server.ReplicaFetcherThread)
> {quote}
> Broker 6:
> {quote}
> [2016-04-18 05:57:12,699] INFO [Group Metadata Manager on Broker 6]: Removed 
> 0 expired offsets in 0 milliseconds. (kafka.coordinator.GroupMetadataManager)
> [2016-04-18 06:05:55,281] INFO Partition [__consumer_offsets,47] on broker 6: 
> Shrinking ISR for partition [__consumer_offsets,47] from 5,6,4 to 6 
> (kafka.cluster.Partition)
> [2016-04-18 06:06:03,656] INFO re-registering broker info in ZK for broker 6 
> (kafka.server.KafkaHealthcheck)
> [2016-04-18 06:06:03,691] INFO Creating /brokers/ids/6 (is it secure? false) 
> (kafka.utils.ZKCheckedEphemeral)
> [2016-04-18 06:06:03,696] INFO Result of znode creation is: OK 
> (kafka.utils.ZKCheckedEphemeral)
> [2016-04-18 06:06:03,696] INFO Partition [__consumer_offsets,47] on broker 6: 
> Cached zkVersion [90] not equal to that in zookeeper, skip updating ISR 
> (kafka.cluster.Partition)
> [2016-04-18 06:06:03,697] INFO Registered broker 6 at path /brokers/ids/6 
> with addresses: PLAINTEXT -> EndPoint(kafka305.prod.local,9092,PLAINTEXT) 
> (kafka.utils.ZkUtils)
> [2016-04-18 06:06:03,698] INFO done re-registering broker 
> (kafka.server.KafkaHealthcheck)
> [2016-04-18 06:06:03,698] INFO Partition [__consumer_offsets,26] on broker 6: 
> Shrinking ISR for partition [__consumer_offsets,26] from 5,6,4 to 6,4 
> (kafka.cluster.Partition)
> [2016-04-18 06:06:03,699] INFO Subscribing to /brokers/topics path to watch 
> for new topics (kafka.server.KafkaHealthcheck)
> [2016-04-18 06:06:03,709] INFO Partition [__consumer_offsets,26] on broker 6: 
> Cached zkVersion [88] not equal to that in zookeeper, skip updating ISR 
> (kafka.cluster.Partition)
> [2016-04-18 06:06:03,821] INFO New leader is 4 
> (kafka.server.ZookeeperLeaderElector$LeaderChangeListener)
> [2016-04-18 06:06:12,390] INFO Partition [__consumer_offsets,47] on broker 6: 
> Shrinking ISR for partition [__consumer_offsets,47] from 5,6,4 to 6 
> (kafka.cluster.Partition)
> [2016-04-18 06:06:12,406] INFO Partition [__consumer_offsets,47] on broker 6: 
> Cached zkVersion [90] not equal to that in zookeeper, skip updating ISR 
> (kafka.cluster.Partition)
> [2016-04-18 06:06:12,410] INFO Partition [__consumer_offsets,41] on broker 6: 
> Shrinking ISR for partition [__consumer_offsets,41] from 5,6,4 to 6 
> (kafka.cluster.Partition)
> [2016-04-18 06:06:12,421] INFO Partition [__consumer_offsets,41] on broker 6: 
> Cached zkVersion [80] not equal to that in zookeeper, skip updating ISR 
> (kafka.cluster.Partition)
> [2016-04-18 06:06:12,422] INFO Partition [cdrrecords,0] on broker 6: 
> Shrinking ISR for partition [cdrrecords,0] from 5,6,4 to 6 
> (kafka.cluster.Partition)
> [2016-04-18 06:06:12,432] INFO Partition [cdrrecords,0] on broker 6: Cached 
> zkVersion [170] not equal to that in zookeeper, skip updating ISR 
> (kafka.cluster.Partition)
> [2016-04-18 06:06:12,432] INFO Partition [__consumer_offsets,20] on broker 6: 
> Shrinking ISR for partition [__consumer_offsets,20] from 5,6,4 to 6 
> (kafka.cluster.Partition)
> [2016-04-18 06:06:12,442] INFO Partition [__consumer_offsets,20] on broker 6: 
> Cached zkVersion [82] not equal to that in zookeeper, skip updating ISR 
> (kafka.cluster.Partition)
> [2016-04-18 06:06:12,442] INFO Partition [__consumer_offsets,32] on broker 6: 
> Shrinking ISR for partition [__consumer_offsets,32] from 5,6,4 to 6 
> (kafka.cluster.Partition)
> [2016-04-18 06:06:12,452] INFO Partition [__consumer_offsets,32] on broker 6: 
> Cached zkVersion [83] not equal to that in zookeeper, skip updating ISR 
> (kafka.cluster.Partition)
> [2016-04-18 06:06:12,452] INFO Partition [cdrrecords-errors,1] on broker 6: 
> Shrinking ISR for partition [cdrrecords-errors,1] from 5,6,4 to 6 
> (kafka.cluster.Partition)
> [2016-04-18 06:06:12,462] INFO Partition [cdrrecords-errors,1] on broker 6: 
> Cached zkVersion [135] not equal to that in zookeeper, skip updating ISR 
> (kafka.cluster.Partition)
> [2016-04-18 06:06:12,462] INFO Partition [__consumer_offsets,38] on broker 6: 
> Shrinking ISR for partition [__consumer_offsets,38] from 5,6,4 to 6 
> (kafka.cluster.Partition)
> [2016-04-18 06:06:12,503] INFO Partition [__consumer_offsets,38] on broker 6: 
> Cached zkVersion [83] not equal to that in zookeeper, skip updating ISR 
> (kafka.cluster.Partition)
> [2016-04-18 06:06:12,503] INFO Partition [__consumer_offsets,5] on broker 6: 
> Shrinking ISR for partition [__consumer_offsets,5] from 5,6,4 to 6 
> (kafka.cluster.Partition)
> [2016-04-18 06:06:12,515] INFO Partition [__consumer_offsets,5] on broker 6: 
> Cached zkVersion [82] not equal to that in zookeeper, skip updating ISR 
> (kafka.cluster.Partition)
> [2016-04-18 06:06:12,515] INFO Partition [__consumer_offsets,23] on broker 6: 
> Shrinking ISR for partition [__consumer_offsets,23] from 5,6,4 to 6 
> (kafka.cluster.Partition)
> [2016-04-18 06:06:12,526] INFO Partition [__consumer_offsets,23] on broker 6: 
> Cached zkVersion [82] not equal to that in zookeeper, skip updating ISR 
> (kafka.cluster.Partition)
> [2016-04-18 06:06:12,526] INFO Partition [events-errors,2] on broker 6: 
> Shrinking ISR for partition [events-errors,2] from 5,6,4 to 6 
> (kafka.cluster.Partition)
> [2016-04-18 06:06:12,536] INFO Partition [events-errors,2] on broker 6: 
> Cached zkVersion [167] not equal to that in zookeeper, skip updating ISR 
> (kafka.cluster.Partition)
> [2016-04-18 06:06:12,537] INFO Partition [events,2] on broker 6: Shrinking 
> ISR for partition [events,2] from 5,6,4 to 6 (kafka.cluster.Partition)
> [2016-04-18 06:06:12,546] INFO Partition [events,2] on broker 6: Cached 
> zkVersion [237] not equal to that in zookeeper, skip updating ISR 
> (kafka.cluster.Partition)
> [2016-04-18 06:06:12,547] INFO Partition [__consumer_offsets,8] on broker 6: 
> Shrinking ISR for partition [__consumer_offsets,8] from 5,6,4 to 6 
> (kafka.cluster.Partition)
> [2016-04-18 06:06:12,557] INFO Partition [__consumer_offsets,8] on broker 6: 
> Cached zkVersion [82] not equal to that in zookeeper, skip updating ISR 
> (kafka.cluster.Partition)
> [2016-04-18 06:06:12,557] INFO Partition [__consumer_offsets,44] on broker 6: 
> Shrinking ISR for partition [__consumer_offsets,44] from 5,6,4 to 6 
> (kafka.cluster.Partition)
> [2016-04-18 06:06:12,567] INFO Partition [__consumer_offsets,44] on broker 6: 
> Cached zkVersion [83] not equal to that in zookeeper, skip updating ISR 
> (kafka.cluster.Partition)
> [2016-04-18 06:06:12,567] INFO Partition [__consumer_offsets,29] on broker 6: 
> Shrinking ISR for partition [__consumer_offsets,29] from 5,6,4 to 6 
> (kafka.cluster.Partition)
> {quote}
> _and this goes on until the broker is restarted_



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to