[ https://issues.apache.org/jira/browse/KAFKA-9044?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16968622#comment-16968622 ]
Peter Bukowinski commented on KAFKA-9044: ----------------------------------------- I parsed the zk transaction log and there is deletion event that lines up with the timestamp from the controller's log. 11/6/19 3:24:46 AM PST session 0x36b4976857ae342 cxid 0x1 zxid 0xa0244538f delete '/kafka/brokers/ids/23 11/6/19 3:24:46 AM PST session 0x26b4976aa821783 cxid 0x7 zxid 0xa02445390 setData '/kafka/CruiseControlBrokerList,#32333d31353733303339343836313233,1467596 11/6/19 3:24:46 AM PST session 0x36b49768579d948 cxid 0x1f5 zxid 0xa024454b8 setData '/kafka/brokers/topics/ac/partitions/207/state,#7b22636f6e74726f6c6c65725f65706f6368223a31372c226c6561646572223a31382c2276657273696f6e223a312c226c65616465725f65706f6368223a31332c22697372223a5b31382c312c32335d7d,28 11/6/19 3:24:46 AM PST session 0x16c9250fa50af3e cxid 0x2c6 zxid 0xa024454b9 setData '/kafka/brokers/topics/Fe/partitions/199/state,#7b22636f6e74726f6c6c65725f65706f6368223a31372c226c6561646572223a322c2276657273696f6e223a312c226c65616465725f65706f6368223a31332c22697372223a5b322c312c32335d7d,28 ... > Brokers occasionally (randomly?) dropping out of clusters > --------------------------------------------------------- > > Key: KAFKA-9044 > URL: https://issues.apache.org/jira/browse/KAFKA-9044 > Project: Kafka > Issue Type: Bug > Components: core > Affects Versions: 2.3.0, 2.3.1 > Environment: Ubuntu 14.04 > Reporter: Peter Bukowinski > Priority: Major > > I have several cluster running kafka 2.3.1 and this issue has affected all of > them. Because of replication and the size of the clusters (30 brokers), this > bug is not causing any data loss, but it is nevertheless concerning. When a > broker drops out, the log gives no indication that there are any zookeeper > issues (and indeed the zookeepers are healthy when this occurs. Here's > snippet from a broker log when it occurs: > {{[2019-10-07 11:02:27,630] INFO [GroupMetadataManager brokerId=14] Removed 0 > expired offsets in 0 milliseconds. > (kafka.coordinator.group.GroupMetadataManager)}} > {{[2019-10-07 11:02:56,936] INFO [Log partition=internal_test-52, > dir=/data/3/kl] Found deletable segments with base offsets [1975332] due to > retention time 3600000ms breach (kafka.log.Log)}} > {{[2019-10-07 11:02:56,936] INFO [Log partition=internal_test-52, > dir=/data/3/kl] Scheduling log segment [baseOffset 1975332, size 92076008] > for deletion. (kafka.log.Log)}} > {{[2019-10-07 11:02:56,936] INFO [Log partition=internal_test-52, > dir=/data/3/kl] Incrementing log start offset to 2000317 (kafka.log.Log)}} > {{[2019-10-07 11:03:56,936] INFO [Log partition=internal_test-52, > dir=/data/3/kl] Deleting segment 1975332 (kafka.log.Log)}} > {{[2019-10-07 11:03:56,957] INFO Deleted log > /data/3/kl/internal_test-52/00000000000001975332.log.deleted. > (kafka.log.LogSegment)}} > {{[2019-10-07 11:03:56,957] INFO Deleted offset index > /data/3/kl/internal_test-52/00000000000001975332.index.deleted. > (kafka.log.LogSegment)}} > {{[2019-10-07 11:03:56,958] INFO Deleted time index > /data/3/kl/internal_test-52/00000000000001975332.timeindex.deleted. > (kafka.log.LogSegment)}} > {{[2019-10-07 11:12:27,630] INFO [GroupMetadataManager brokerId=14] Removed > 0 expired offsets in 0 milliseconds. > (kafka.coordinator.group.GroupMetadataManager)}} > {{[2019-10-07 11:22:27,629] INFO [GroupMetadataManager brokerId=14] Removed > 0 expired offsets in 0 milliseconds. > (kafka.coordinator.group.GroupMetadataManager)}} > {{[2019-10-07 11:32:27,629] INFO [GroupMetadataManager brokerId=14] Removed > 0 expired offsets in 0 milliseconds. > (kafka.coordinator.group.GroupMetadataManager)}} > {{[2019-10-07 11:42:27,629] INFO [GroupMetadataManager brokerId=14] Removed > 0 expired offsets in 0 milliseconds. > (kafka.coordinator.group.GroupMetadataManager)}} > {{[2019-10-07 11:52:27,629] INFO [GroupMetadataManager brokerId=14] Removed > 0 expired offsets in 0 milliseconds. > (kafka.coordinator.group.GroupMetadataManager)}} > {{[2019-10-07 12:02:27,629] INFO [GroupMetadataManager brokerId=14] Removed > 0 expired offsets in 0 milliseconds. > (kafka.coordinator.group.GroupMetadataManager)}} > {{[2019-10-07 12:12:27,630] INFO [GroupMetadataManager brokerId=14] Removed > 0 expired offsets in 0 milliseconds. > (kafka.coordinator.group.GroupMetadataManager)}} > {{[2019-10-07 12:22:27,629] INFO [GroupMetadataManager brokerId=14] Removed > 0 expired offsets in 0 milliseconds. > (kafka.coordinator.group.GroupMetadataManager)}} > {{[2019-10-07 12:32:27,629] INFO [GroupMetadataManager brokerId=14] Removed > 0 expired offsets in 0 milliseconds. > (kafka.coordinator.group.GroupMetadataManager)}} > {{[2019-10-07 12:42:27,630] INFO [GroupMetadataManager brokerId=14] Removed > 0 expired offsets in 1 milliseconds. > (kafka.coordinator.group.GroupMetadataManager)}} > {{[2019-10-07 12:52:27,629] INFO [GroupMetadataManager brokerId=14] Removed > 0 expired offsets in 0 milliseconds. > (kafka.coordinator.group.GroupMetadataManager)}} > {{[2019-10-07 13:02:27,630] INFO [GroupMetadataManager brokerId=14] Removed > 0 expired offsets in 0 milliseconds. > (kafka.coordinator.group.GroupMetadataManager)}} > {{[2019-10-07 13:12:27,629] INFO [GroupMetadataManager brokerId=14] Removed > 0 expired offsets in 0 milliseconds. > (kafka.coordinator.group.GroupMetadataManager)}} > {{[2019-10-07 13:22:27,629] INFO [GroupMetadataManager brokerId=14] Removed > 0 expired offsets in 0 milliseconds. > (kafka.coordinator.group.GroupMetadataManager)}} > {{[2019-10-07 13:32:27,629] INFO [GroupMetadataManager brokerId=14] Removed > 0 expired offsets in 0 milliseconds. > (kafka.coordinator.group.GroupMetadataManager)}} > {{[2019-10-07 13:42:27,629] INFO [GroupMetadataManager brokerId=14] Removed > 0 expired offsets in 0 milliseconds. > (kafka.coordinator.group.GroupMetadataManager)}} > {{[2019-10-07 13:52:27,630] INFO [GroupMetadataManager brokerId=14] Removed > 0 expired offsets in 1 milliseconds. > (kafka.coordinator.group.GroupMetadataManager)}} > {{[2019-10-07 14:02:27,629] INFO [GroupMetadataManager brokerId=14] Removed > 0 expired offsets in 0 milliseconds. > (kafka.coordinator.group.GroupMetadataManager)}} > {{[2019-10-07 14:12:27,629] INFO [GroupMetadataManager brokerId=14] Removed > 0 expired offsets in 0 milliseconds. > (kafka.coordinator.group.GroupMetadataManager)}} > {{[2019-10-07 14:22:27,630] INFO [GroupMetadataManager brokerId=14] Removed > 0 expired offsets in 1 milliseconds. > (kafka.coordinator.group.GroupMetadataManager)}} > {{[2019-10-07 14:32:27,629] INFO [GroupMetadataManager brokerId=14] Removed > 0 expired offsets in 0 milliseconds. > (kafka.coordinator.group.GroupMetadataManager)}} > {{[2019-10-07 14:42:27,629] INFO [GroupMetadataManager brokerId=14] Removed > 0 expired offsets in 0 milliseconds. > (kafka.coordinator.group.GroupMetadataManager)}} > {{[2019-10-07 14:46:07,510] INFO [Partition internal_test-33 broker=14] > Shrinking ISR from 16,17,14 to 14. Leader: (highWatermark: 2007553, > endOffset: 2007555). Out of sync replicas: (brokerId: 16, endOffset: 2007553) > (brokerId: 17, endOffset: 2007553). (kafka.cluster.Partition)}} > {{[2019-10-07 14:46:07,511] INFO [Partition internal_test-33 broker=14] > Cached zkVersion [17] not equal to that in zookeeper, skip updating ISR > (kafka.cluster.Partition)}} > The controller log shows the following: > {{[2019-10-07 14:45:55,427] INFO [Controller id=24] Newly added brokers: , > deleted brokers: 14, bounced brokers: , all live brokers: > 1,2,3,4,5,6,7,8,9,10,11,12,13,15,16,17,18,19,20,21,22,23,24,25 > (kafka.controller.KafkaController)}} > {{[2019-10-07 14:45:55,477] INFO [RequestSendThread controllerId=24] Shutting > down (kafka.controller.RequestSendThread)}} > {{[2019-10-07 14:45:55,477] INFO [RequestSendThread controllerId=24] Shutdown > completed (kafka.controller.RequestSendThread)}} > {{[2019-10-07 14:45:55,477] INFO [RequestSendThread controllerId=24] Stopped > (kafka.controller.RequestSendThread)}} > {{[2019-10-07 14:45:55,481] INFO [Controller id=24] Broker failure callback > for 14 (kafka.controller.KafkaController)}} > The clusters use {{zookeeper.session.timeout.ms=45000}}, and > {{zookeeper.connection.timeout.ms=90000}}. > I'm unable to find a cause for this behavior. The only thing I can do to > resolve the issue is to restart the broker. -- This message was sent by Atlassian Jira (v8.3.4#803005)