[ https://issues.apache.org/jira/browse/KAFKA-3577?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Ray Chiang resolved KAFKA-3577. ------------------------------- Resolution: Duplicate > Partial cluster breakdown > ------------------------- > > Key: KAFKA-3577 > URL: https://issues.apache.org/jira/browse/KAFKA-3577 > Project: Kafka > Issue Type: Bug > Affects Versions: 0.9.0.1 > Environment: Debian GNU/Linux 7.9 (wheezy) > Reporter: Kim Christensen > Priority: Major > > We run a cluster of 3 brokers and 3 zookeepers, but after we upgraded to > 0.9.0.1 our cluster sometimes goes partially down, and we can't figure why. A > full cluster restart fixed the problem. > I've added a snippet of the logs on each broker below. > Broker 4: > {quote} > [2016-04-18 05:58:26,390] INFO [Group Metadata Manager on Broker 4]: Removed > 0 expired offsets in 0 milliseconds. (kafka.coordinator.GroupMetadataManager) > [2016-04-18 06:05:55,218] INFO Creating /controller (is it secure? false) > (kafka.utils.ZKCheckedEphemeral) > [2016-04-18 06:05:57,396] ERROR Session has expired while creating > /controller (kafka.utils.ZKCheckedEphemeral) > [2016-04-18 06:05:57,396] INFO Result of znode creation is: SESSIONEXPIRED > (kafka.utils.ZKCheckedEphemeral) > [2016-04-18 06:05:57,400] ERROR Error while electing or becoming leader on > broker 4 (kafka.server.ZookeeperLeaderElector) > org.I0Itec.zkclient.exception.ZkException: > org.apache.zookeeper.KeeperException$SessionExpiredException: KeeperErrorCode > = Session expired > at > org.I0Itec.zkclient.exception.ZkException.create(ZkException.java:68) > at kafka.utils.ZKCheckedEphemeral.create(ZkUtils.scala:1090) > at > kafka.server.ZookeeperLeaderElector.elect(ZookeeperLeaderElector.scala:81) > at > kafka.server.ZookeeperLeaderElector$LeaderChangeListener$$anonfun$handleDataDeleted$1.apply$mcZ$sp(ZookeeperLeaderElector.scala:146) > at > kafka.server.ZookeeperLeaderElector$LeaderChangeListener$$anonfun$handleDataDeleted$1.apply(ZookeeperLeaderElector.scala:141) > at > kafka.server.ZookeeperLeaderElector$LeaderChangeListener$$anonfun$handleDataDeleted$1.apply(ZookeeperLeaderElector.scala:141) > at kafka.utils.CoreUtils$.inLock(CoreUtils.scala:262) > at > kafka.server.ZookeeperLeaderElector$LeaderChangeListener.handleDataDeleted(ZookeeperLeaderElector.scala:141) > at org.I0Itec.zkclient.ZkClient$9.run(ZkClient.java:823) > at org.I0Itec.zkclient.ZkEventThread.run(ZkEventThread.java:71) > Caused by: org.apache.zookeeper.KeeperException$SessionExpiredException: > KeeperErrorCode = Session expired > at > org.apache.zookeeper.KeeperException.create(KeeperException.java:127) > ... 9 more > [2016-04-18 06:05:57,420] INFO Creating /controller (is it secure? false) > (kafka.utils.ZKCheckedEphemeral) > [2016-04-18 06:05:57,424] INFO Result of znode creation is: OK > (kafka.utils.ZKCheckedEphemeral) > [2016-04-18 06:05:57,425] INFO 4 successfully elected as leader > (kafka.server.ZookeeperLeaderElector) > [2016-04-18 06:05:57,885] INFO [ReplicaFetcherManager on broker 4] Removed > fetcher for partitions > [__consumer_offsets,32],[__consumer_offsets,44],[cdrrecords-errors,1],[cdrrecords,0],[__consumer_offsets,38],[__consumer_offsets,8],[events > ,2],[__consumer_offsets,20],[__consumer_offsets,2],[__consumer_offsets,14],[__consumer_offsets,26] > (kafka.server.ReplicaFetcherManager) > [2016-04-18 06:05:57,892] INFO [ReplicaFetcherManager on broker 4] Removed > fetcher for partitions > [__consumer_offsets,35],[__consumer_offsets,23],[__consumer_offsets,47],[__consumer_offsets,11],[__consumer_offsets,5],[events-errors,2],[_ > _consumer_offsets,17],[__consumer_offsets,41],[__consumer_offsets,29] > (kafka.server.ReplicaFetcherManager) > [2016-04-18 06:05:57,894] INFO Truncating log __consumer_offsets-17 to offset > 0. (kafka.log.Log) > [2016-04-18 06:05:57,894] INFO Truncating log __consumer_offsets-23 to offset > 0. (kafka.log.Log) > [2016-04-18 06:05:57,894] INFO Truncating log __consumer_offsets-29 to offset > 0. (kafka.log.Log) > [2016-04-18 06:05:57,895] INFO Truncating log __consumer_offsets-35 to offset > 0. (kafka.log.Log) > [2016-04-18 06:05:57,895] INFO Truncating log __consumer_offsets-41 to offset > 0. (kafka.log.Log) > [2016-04-18 06:05:57,895] INFO Truncating log events-errors-2 to offset 0. > (kafka.log.Log) > [2016-04-18 06:05:57,895] INFO Truncating log __consumer_offsets-5 to offset > 0. (kafka.log.Log) > [2016-04-18 06:05:57,895] INFO Truncating log __consumer_offsets-11 to offset > 0. (kafka.log.Log) > [2016-04-18 06:05:57,896] INFO Truncating log __consumer_offsets-47 to offset > 0. (kafka.log.Log) > [2016-04-18 06:05:57,904] INFO [ReplicaFetcherManager on broker 4] Added > fetcher for partitions List([[__consumer_offsets,17], initOffset 0 to broker > BrokerEndPoint(5,kafka304.prod.local,9092)] , [[__consumer_offsets,23], > initOffset 0 to broker BrokerEndPoint(5,kafka304.prod.local,9092)] , > [[__consumer_offsets,29], initOffset 0 to broker > BrokerEndPoint(5,kafka304.prod.local,9092)] , [[__consumer_offsets,35], > initOffset 0 to broker BrokerEndPoint(5,kafka304.prod.local,9092)] , > [[__consumer_offsets,41], initOffset 0 to broker > BrokerEndPoint(5,kafka304.prod.local,9092)] , [[events-errors,2], initOffset > 0 to broker BrokerEndPoint(5,kafka304.prod.local,9092)] , > [[__consumer_offsets,5], initOffset 0 to broker > BrokerEndPoint(5,kafka304.prod.local,9092)] , [[__consumer_offsets,11], > initOffset 0 to broker BrokerEndPoint(5,kafka304.prod.local,9092)] , > [[__consumer_offsets,47], initOffset 0 to broker > BrokerEndPoint(5,kafka304.prod.local,9092)] ) > (kafka.server.ReplicaFetcherManager) > [2016-04-18 06:05:57,908] INFO [ReplicaFetcherThread-0-6], Shutting down > (kafka.server.ReplicaFetcherThread) > [2016-04-18 06:05:58,783] INFO re-registering broker info in ZK for broker 4 > (kafka.server.KafkaHealthcheck) > [2016-04-18 06:05:58,784] INFO Creating /brokers/ids/4 (is it secure? false) > (kafka.utils.ZKCheckedEphemeral) > [2016-04-18 06:05:58,799] INFO Result of znode creation is: OK > (kafka.utils.ZKCheckedEphemeral) > [2016-04-18 06:05:58,799] INFO Registered broker 4 at path /brokers/ids/4 > with addresses: PLAINTEXT -> EndPoint(kafka303.prod.local,9092,PLAINTEXT) > (kafka.utils.ZkUtils) > [2016-04-18 06:05:58,799] INFO done re-registering broker > (kafka.server.KafkaHealthcheck) > [2016-04-18 06:05:58,799] INFO Subscribing to /brokers/topics path to watch > for new topics (kafka.server.KafkaHealthcheck) > [2016-04-18 06:05:58,804] INFO New leader is 4 > (kafka.server.ZookeeperLeaderElector$LeaderChangeListener) > [2016-04-18 06:05:58,807] INFO New leader is 4 > (kafka.server.ZookeeperLeaderElector$LeaderChangeListener) > [2016-04-18 06:06:03,895] INFO [ReplicaFetcherThread-0-6], Stopped > (kafka.server.ReplicaFetcherThread) > [2016-04-18 06:06:03,895] INFO [ReplicaFetcherThread-0-6], Shutdown completed > (kafka.server.ReplicaFetcherThread) > {quote} > Broker 5: > {quote} > [2016-04-18 05:57:05,926] INFO [Group Metadata Manager on Broker 5]: Removed > 0 expired offsets in 0 milliseconds. (kafka.coordinator.GroupMetadataManager) > [2016-04-18 06:05:56,304] INFO Creating /controller (is it secure? false) > (kafka.utils.ZKCheckedEphemeral) > [2016-04-18 06:05:56,309] INFO Result of znode creation is: OK > (kafka.utils.ZKCheckedEphemeral) > [2016-04-18 06:05:56,312] INFO 5 successfully elected as leader > (kafka.server.ZookeeperLeaderElector) > [2016-04-18 06:05:58,545] INFO [ReplicaFetcherManager on broker 5] Removed > fetcher for partitions > [__consumer_offsets,17],[__consumer_offsets,23],[__consumer_offsets,29],[__consumer_offsets,35],[__consumer_offsets,41],[events-errors,2],[ > __consumer_offsets,5],[__consumer_offsets,11],[__consumer_offsets,47] > (kafka.server.ReplicaFetcherManager) > [2016-04-18 06:05:58,554] INFO [ReplicaFetcherManager on broker 5] Removed > fetcher for partitions > [__consumer_offsets,2],[__consumer_offsets,38],[__consumer_offsets,44],[events,2],[__consumer_offsets,32],[cdrrecords-errors,1],[__consumer > _offsets,8],[__consumer_offsets,20],[cdrrecords,0],[__consumer_offsets,14],[__consumer_offsets,26] > (kafka.server.ReplicaFetcherManager) > [2016-04-18 06:05:58,556] INFO Truncating log __consumer_offsets-32 to offset > 0. (kafka.log.Log) > [2016-04-18 06:05:58,556] INFO Truncating log __consumer_offsets-44 to offset > 0. (kafka.log.Log) > [2016-04-18 06:05:58,556] INFO Truncating log cdrrecords-errors-1 to offset > 8. (kafka.log.Log) > [2016-04-18 06:05:58,556] INFO Truncating log cdrrecords-0 to offset 172892. > (kafka.log.Log) > [2016-04-18 06:05:58,557] INFO Truncating log __consumer_offsets-38 to offset > 0. (kafka.log.Log) > [2016-04-18 06:05:58,557] INFO Truncating log __consumer_offsets-8 to offset > 0. (kafka.log.Log) > [2016-04-18 06:05:58,557] INFO Truncating log events-2 to offset 29954. > (kafka.log.Log) > [2016-04-18 06:05:58,557] INFO Truncating log __consumer_offsets-20 to offset > 0. (kafka.log.Log) > [2016-04-18 06:05:58,558] INFO Truncating log __consumer_offsets-2 to offset > 0. (kafka.log.Log) > [2016-04-18 06:05:58,558] INFO Truncating log __consumer_offsets-14 to offset > 0. (kafka.log.Log) > [2016-04-18 06:05:58,558] INFO Truncating log __consumer_offsets-26 to offset > 0. (kafka.log.Log) > [2016-04-18 06:05:58,565] INFO [ReplicaFetcherManager on broker 5] Added > fetcher for partitions List([[__consumer_offsets,32], initOffset 0 to broker > BrokerEndPoint(4,kafka303.prod.local,9092)] , [[__consumer_offsets,44], > initOffset 0 to > broker BrokerEndPoint(4,kafka303.prod.local,9092)] , [[cdrrecords-errors,1], > initOffset 8 to broker BrokerEndPoint(4,kafka303.prod.local,9092)] , > [[cdrrecords,0], initOffset 172892 to broker > BrokerEndPoint(4,kafka303.prod.local,9092)] , > [[__consumer_offsets,38], initOffset 0 to broker > BrokerEndPoint(4,kafka303.prod.local,9092)] , [[__consumer_offsets,8], > initOffset 0 to broker BrokerEndPoint(4,kafka303.prod.local,9092)] , > [[events,2], initOffset 29954 to broker BrokerE > ndPoint(4,kafka303.prod.local,9092)] , [[__consumer_offsets,20], initOffset 0 > to broker BrokerEndPoint(4,kafka303.prod.local,9092)] , > [[__consumer_offsets,2], initOffset 0 to broker > BrokerEndPoint(4,kafka303.prod.local,9092)] , [[__consu > mer_offsets,14], initOffset 0 to broker > BrokerEndPoint(4,kafka303.prod.local,9092)] , [[__consumer_offsets,26], > initOffset 0 to broker BrokerEndPoint(4,kafka303.prod.local,9092)] ) > (kafka.server.ReplicaFetcherManager) > [2016-04-18 06:05:58,568] INFO [ReplicaFetcherThread-0-6], Shutting down > (kafka.server.ReplicaFetcherThread) > [2016-04-18 06:05:58,613] INFO New leader is 4 > (kafka.server.ZookeeperLeaderElector$LeaderChangeListener) > [2016-04-18 06:06:04,571] INFO [ReplicaFetcherThread-0-6], Stopped > (kafka.server.ReplicaFetcherThread) > [2016-04-18 06:06:04,571] INFO [ReplicaFetcherThread-0-6], Shutdown completed > (kafka.server.ReplicaFetcherThread) > {quote} > Broker 6: > {quote} > [2016-04-18 05:57:12,699] INFO [Group Metadata Manager on Broker 6]: Removed > 0 expired offsets in 0 milliseconds. (kafka.coordinator.GroupMetadataManager) > [2016-04-18 06:05:55,281] INFO Partition [__consumer_offsets,47] on broker 6: > Shrinking ISR for partition [__consumer_offsets,47] from 5,6,4 to 6 > (kafka.cluster.Partition) > [2016-04-18 06:06:03,656] INFO re-registering broker info in ZK for broker 6 > (kafka.server.KafkaHealthcheck) > [2016-04-18 06:06:03,691] INFO Creating /brokers/ids/6 (is it secure? false) > (kafka.utils.ZKCheckedEphemeral) > [2016-04-18 06:06:03,696] INFO Result of znode creation is: OK > (kafka.utils.ZKCheckedEphemeral) > [2016-04-18 06:06:03,696] INFO Partition [__consumer_offsets,47] on broker 6: > Cached zkVersion [90] not equal to that in zookeeper, skip updating ISR > (kafka.cluster.Partition) > [2016-04-18 06:06:03,697] INFO Registered broker 6 at path /brokers/ids/6 > with addresses: PLAINTEXT -> EndPoint(kafka305.prod.local,9092,PLAINTEXT) > (kafka.utils.ZkUtils) > [2016-04-18 06:06:03,698] INFO done re-registering broker > (kafka.server.KafkaHealthcheck) > [2016-04-18 06:06:03,698] INFO Partition [__consumer_offsets,26] on broker 6: > Shrinking ISR for partition [__consumer_offsets,26] from 5,6,4 to 6,4 > (kafka.cluster.Partition) > [2016-04-18 06:06:03,699] INFO Subscribing to /brokers/topics path to watch > for new topics (kafka.server.KafkaHealthcheck) > [2016-04-18 06:06:03,709] INFO Partition [__consumer_offsets,26] on broker 6: > Cached zkVersion [88] not equal to that in zookeeper, skip updating ISR > (kafka.cluster.Partition) > [2016-04-18 06:06:03,821] INFO New leader is 4 > (kafka.server.ZookeeperLeaderElector$LeaderChangeListener) > [2016-04-18 06:06:12,390] INFO Partition [__consumer_offsets,47] on broker 6: > Shrinking ISR for partition [__consumer_offsets,47] from 5,6,4 to 6 > (kafka.cluster.Partition) > [2016-04-18 06:06:12,406] INFO Partition [__consumer_offsets,47] on broker 6: > Cached zkVersion [90] not equal to that in zookeeper, skip updating ISR > (kafka.cluster.Partition) > [2016-04-18 06:06:12,410] INFO Partition [__consumer_offsets,41] on broker 6: > Shrinking ISR for partition [__consumer_offsets,41] from 5,6,4 to 6 > (kafka.cluster.Partition) > [2016-04-18 06:06:12,421] INFO Partition [__consumer_offsets,41] on broker 6: > Cached zkVersion [80] not equal to that in zookeeper, skip updating ISR > (kafka.cluster.Partition) > [2016-04-18 06:06:12,422] INFO Partition [cdrrecords,0] on broker 6: > Shrinking ISR for partition [cdrrecords,0] from 5,6,4 to 6 > (kafka.cluster.Partition) > [2016-04-18 06:06:12,432] INFO Partition [cdrrecords,0] on broker 6: Cached > zkVersion [170] not equal to that in zookeeper, skip updating ISR > (kafka.cluster.Partition) > [2016-04-18 06:06:12,432] INFO Partition [__consumer_offsets,20] on broker 6: > Shrinking ISR for partition [__consumer_offsets,20] from 5,6,4 to 6 > (kafka.cluster.Partition) > [2016-04-18 06:06:12,442] INFO Partition [__consumer_offsets,20] on broker 6: > Cached zkVersion [82] not equal to that in zookeeper, skip updating ISR > (kafka.cluster.Partition) > [2016-04-18 06:06:12,442] INFO Partition [__consumer_offsets,32] on broker 6: > Shrinking ISR for partition [__consumer_offsets,32] from 5,6,4 to 6 > (kafka.cluster.Partition) > [2016-04-18 06:06:12,452] INFO Partition [__consumer_offsets,32] on broker 6: > Cached zkVersion [83] not equal to that in zookeeper, skip updating ISR > (kafka.cluster.Partition) > [2016-04-18 06:06:12,452] INFO Partition [cdrrecords-errors,1] on broker 6: > Shrinking ISR for partition [cdrrecords-errors,1] from 5,6,4 to 6 > (kafka.cluster.Partition) > [2016-04-18 06:06:12,462] INFO Partition [cdrrecords-errors,1] on broker 6: > Cached zkVersion [135] not equal to that in zookeeper, skip updating ISR > (kafka.cluster.Partition) > [2016-04-18 06:06:12,462] INFO Partition [__consumer_offsets,38] on broker 6: > Shrinking ISR for partition [__consumer_offsets,38] from 5,6,4 to 6 > (kafka.cluster.Partition) > [2016-04-18 06:06:12,503] INFO Partition [__consumer_offsets,38] on broker 6: > Cached zkVersion [83] not equal to that in zookeeper, skip updating ISR > (kafka.cluster.Partition) > [2016-04-18 06:06:12,503] INFO Partition [__consumer_offsets,5] on broker 6: > Shrinking ISR for partition [__consumer_offsets,5] from 5,6,4 to 6 > (kafka.cluster.Partition) > [2016-04-18 06:06:12,515] INFO Partition [__consumer_offsets,5] on broker 6: > Cached zkVersion [82] not equal to that in zookeeper, skip updating ISR > (kafka.cluster.Partition) > [2016-04-18 06:06:12,515] INFO Partition [__consumer_offsets,23] on broker 6: > Shrinking ISR for partition [__consumer_offsets,23] from 5,6,4 to 6 > (kafka.cluster.Partition) > [2016-04-18 06:06:12,526] INFO Partition [__consumer_offsets,23] on broker 6: > Cached zkVersion [82] not equal to that in zookeeper, skip updating ISR > (kafka.cluster.Partition) > [2016-04-18 06:06:12,526] INFO Partition [events-errors,2] on broker 6: > Shrinking ISR for partition [events-errors,2] from 5,6,4 to 6 > (kafka.cluster.Partition) > [2016-04-18 06:06:12,536] INFO Partition [events-errors,2] on broker 6: > Cached zkVersion [167] not equal to that in zookeeper, skip updating ISR > (kafka.cluster.Partition) > [2016-04-18 06:06:12,537] INFO Partition [events,2] on broker 6: Shrinking > ISR for partition [events,2] from 5,6,4 to 6 (kafka.cluster.Partition) > [2016-04-18 06:06:12,546] INFO Partition [events,2] on broker 6: Cached > zkVersion [237] not equal to that in zookeeper, skip updating ISR > (kafka.cluster.Partition) > [2016-04-18 06:06:12,547] INFO Partition [__consumer_offsets,8] on broker 6: > Shrinking ISR for partition [__consumer_offsets,8] from 5,6,4 to 6 > (kafka.cluster.Partition) > [2016-04-18 06:06:12,557] INFO Partition [__consumer_offsets,8] on broker 6: > Cached zkVersion [82] not equal to that in zookeeper, skip updating ISR > (kafka.cluster.Partition) > [2016-04-18 06:06:12,557] INFO Partition [__consumer_offsets,44] on broker 6: > Shrinking ISR for partition [__consumer_offsets,44] from 5,6,4 to 6 > (kafka.cluster.Partition) > [2016-04-18 06:06:12,567] INFO Partition [__consumer_offsets,44] on broker 6: > Cached zkVersion [83] not equal to that in zookeeper, skip updating ISR > (kafka.cluster.Partition) > [2016-04-18 06:06:12,567] INFO Partition [__consumer_offsets,29] on broker 6: > Shrinking ISR for partition [__consumer_offsets,29] from 5,6,4 to 6 > (kafka.cluster.Partition) > {quote} > _and this goes on until the broker is restarted_ -- This message was sent by Atlassian JIRA (v7.6.3#76005)