[ https://issues.apache.org/jira/browse/KAFKA-9044?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16968593#comment-16968593 ]
Peter Bukowinski commented on KAFKA-9044: ----------------------------------------- I had another drop-out today. Here is the controller's log with context: {{[2019-10-29 18:10:47,350] INFO [Controller id=29] Partition ac-50 completed preferred replica leader election. New leader is 15 (kafka.controller.KafkaController) [2019-10-29 18:10:47,350] INFO [Controller id=29] Partition Sa-39 completed preferred replica leader election. New leader is 7 (kafka.controller.KafkaController) [2019-10-29 18:10:47,350] INFO [Controller id=29] Partition cr-58 completed preferred replica leader election. New leader is 27 (kafka.controller.KafkaController) [2019-10-29 18:10:47,350] INFO [Controller id=29] Partition Sa-53 completed preferred replica leader election. New leader is 26 (kafka.controller.KafkaController) [2019-10-29 18:10:47,350] INFO [Controller id=29] Partition Fe-105 completed preferred replica leader election. New leader is 25 (kafka.controller.KafkaController) [2019-10-29 18:10:47,350] INFO [Controller id=29] Partition En-29 completed preferred replica leader election. New leader is 9 (kafka.controller.KafkaController) [2019-10-29 18:10:47,350] INFO [Controller id=29] Partition Fe-130 completed preferred replica leader election. New leader is 23 (kafka.controller.KafkaController) [2019-10-29 18:10:47,350] INFO [Controller id=29] Partition Sa-57 completed preferred replica leader election. New leader is 17 (kafka.controller.KafkaController) [2019-10-29 18:10:47,350] INFO [Controller id=29] Partition En-73 completed preferred replica leader election. New leader is 28 (kafka.controller.KafkaController) [2019-10-29 18:10:47,353] DEBUG [Controller id=29] PreferredReplicaLeaderElection waiting: Set(...redacted...), results: Map() (kafka.controller.KafkaController) [2019-11-06 03:24:46,122] INFO [Controller id=29] Newly added brokers: , deleted brokers: 23, bounced brokers: , all live brokers: 1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,24,25,26,27,28,29,30 (kafka.controller.KafkaController) [2019-11-06 03:24:46,200] INFO [RequestSendThread controllerId=29] Shutting down (kafka.controller.RequestSendThread) [2019-11-06 03:24:46,200] INFO [RequestSendThread controllerId=29] Stopped (kafka.controller.RequestSendThread) [2019-11-06 03:24:46,201] INFO [RequestSendThread controllerId=29] Shutdown completed (kafka.controller.RequestSendThread) [2019-11-06 03:24:46,206] INFO [Controller id=29] Broker failure callback for 23 (kafka.controller.KafkaController) [2019-11-06 03:24:46,272] INFO [RequestSendThread controllerId=29] Controller 29 connected to kafka0029.example.com:9092 (id: 29 rack: REDACTED) for sending state change requests (kafka.controller.RequestSendThread) [2019-11-06 03:24:46,274] INFO [RequestSendThread controllerId=29] Controller 29 connected to kafka0005.example.com:9092 (id: 5 rack: REDACTED) for sending state change requests (kafka.controller.RequestSendThread) [2019-11-06 03:24:46,274] INFO [RequestSendThread controllerId=29] Controller 29 connected to kafka0019.example.com:9092 (id: 19 rack: REDACTED) for sending state change requests (kafka.controller.RequestSendThread) [2019-11-06 03:24:46,274] INFO [RequestSendThread controllerId=29] Controller 29 connected to kafka0010.example.com:9092 (id: 10 rack: REDACTED) for sending state change requests (kafka.controller.RequestSendThread) [2019-11-06 03:24:46,275] INFO [RequestSendThread controllerId=29] Controller 29 connected to kafka0016.example.com:9092 (id: 16 rack: REDACTED) for sending state change requests (kafka.controller.RequestSendThread) [2019-11-06 03:24:46,275] INFO [RequestSendThread controllerId=29] Controller 29 connected to kafka0007.example.com:9092 (id: 7 rack: REDACTED) for sending state change requests (kafka.controller.RequestSendThread) [2019-11-06 03:24:46,275] INFO [RequestSendThread controllerId=29] Controller 29 connected to kafka0011.example.com:9092 (id: 11 rack: REDACTED) for sending state change requests (kafka.controller.RequestSendThread) [2019-11-06 03:24:46,275] INFO [RequestSendThread controllerId=29] Controller 29 connected to kafka0001.example.com:9092 (id: 1 rack: REDACTED) for sending state change requests (kafka.controller.RequestSendThread) [2019-11-06 03:24:46,275] INFO [RequestSendThread controllerId=29] Controller 29 connected to kafka0020.example.com:9092 (id: 20 rack: REDACTED) for sending state change requests (kafka.controller.RequestSendThread) [2019-11-06 03:24:46,275] INFO [RequestSendThread controllerId=29] Controller 29 connected to kafka0018.example.com:9092 (id: 18 rack: REDACTED) for sending state change requests (kafka.controller.RequestSendThread)}} Here is the broker log, with context: {{[2019-11-06 03:16:11,811] INFO [Log partition=Fe-164, dir=/data/3/kl] Found deletable segments with base offsets [235676977] due to retention time 259200000ms breach (kafka.log.Log) [2019-11-06 03:16:11,812] INFO [Log partition=Fe-164, dir=/data/3/kl] Scheduling log segment [baseOffset 235676977, size 1073605838] for deletion. (kafka.log.Log) [2019-11-06 03:17:11,812] INFO [Log partition=Fe-164, dir=/data/3/kl] Deleting segment 235676977 (kafka.log.Log) [2019-11-06 03:17:12,149] INFO Deleted log /data/3/kl/Fe-164/00000000000235676977.log.deleted. (kafka.log.LogSegment) [2019-11-06 03:17:12,149] INFO Deleted offset index /data/3/kl/Fe-164/00000000000235676977.index.deleted. (kafka.log.LogSegment) [2019-11-06 03:17:12,149] INFO Deleted time index /data/3/kl/Fe-164/00000000000235676977.timeindex.deleted. (kafka.log.LogSegment) [2019-11-06 03:19:12,101] INFO [ProducerStateManager partition=Fe-163] Writing producer snapshot at offset 243728101 (kafka.log.ProducerStateManager) [2019-11-06 03:19:12,102] INFO [Log partition=Fe-163, dir=/data/1/kl] Rolled new log segment at offset 243728101 in 1 ms. (kafka.log.Log) [2019-11-06 03:19:43,422] INFO [ProducerStateManager partition=se-133] Writing producer snapshot at offset 25042454 (kafka.log.ProducerStateManager) [2019-11-06 03:19:43,422] INFO [Log partition=se-133, dir=/data/1/kl] Rolled new log segment at offset 25042454 in 0 ms. (kafka.log.Log) [2019-11-06 03:22:09,511] INFO [ProducerStateManager partition=se-169] Writing producer snapshot at offset 25037202 (kafka.log.ProducerStateManager) [2019-11-06 03:22:09,511] INFO [Log partition=se-169, dir=/data/3/kl] Rolled new log segment at offset 25037202 in 0 ms. (kafka.log.Log) [2019-11-06 03:24:57,452] INFO [Partition Fe-40 broker=23] Shrinking ISR from 16,17,23 to 23. Leader: (highWatermark: 243730381, endOffset: 243730650). Out of sync replicas: (brokerId: 16, endOffset: 243730381) (brokerId: 17, endOffset: 243730382). (kafka.cluster.Partition) [2019-11-06 03:24:57,525] INFO [Partition Fe-40 broker=23] Cached zkVersion [27] not equal to that in zookeeper, skip updating ISR (kafka.cluster.Partition) [2019-11-06 03:24:57,526] INFO [Partition ac-450 broker=23] Shrinking ISR from 4,5,23 to 23. Leader: (highWatermark: 187730758, endOffset: 187731018). Out of sync replicas: (brokerId: 4, endOffset: 187730773) (brokerId: 5, endOffset: 187730758). (kafka.cluster.Partition) [2019-11-06 03:24:57,528] INFO [Partition ac-450 broker=23] Cached zkVersion [27] not equal to that in zookeeper, skip updating ISR (kafka.cluster.Partition) [2019-11-06 03:24:57,529] INFO [Partition fb-46 broker=23] Shrinking ISR from 12,13,23 to 23. Leader: (highWatermark: 3543421, endOffset: 3543424). Out of sync replicas: (brokerId: 12, endOffset: 3543421) (brokerId: 13, endOffset: 3543421). (kafka.cluster.Partition) [2019-11-06 03:24:57,531] INFO [Partition fb-46 broker=23] Cached zkVersion [27] not equal to that in zookeeper, skip updating ISR (kafka.cluster.Partition) [2019-11-06 03:24:57,532] INFO [Partition ac-120 broker=23] Shrinking ISR from 23,29,30 to 23. Leader: (highWatermark: 187738210, endOffset: 187738437). Out of sync replicas: (brokerId: 29, endOffset: 187738212) (brokerId: 30, endOffset: 187738210). (kafka.cluster.Partition) [2019-11-06 03:24:57,535] INFO [Partition ac-120 broker=23] Cached zkVersion [25] not equal to that in zookeeper, skip updating ISR (kafka.cluster.Partition) [2019-11-06 03:24:57,535] INFO [Partition se-37 broker=23] Shrinking ISR from 22,23,24 to 23. Leader: (highWatermark: 25042854, endOffset: 25042891). Out of sync replicas: (brokerId: 22, endOffset: 25042854) (brokerId: 24, endOffset: 25042855). (kafka.cluster.Partition) [2019-11-06 03:24:57,538] INFO [Partition ses-37 broker=23] Cached zkVersion [31] not equal to that in zookeeper, skip updating ISR (kafka.cluster.Partition)}} The state-change log of the affected broker has no entries in this time period. The broker's id not listed in zookeeper. As you can see, everything looks normal and then the broker is kicked out of the cluster. This has occurred on 16 clusters in separate datacenters since I started tracking it. I'd like to stress that it did not occur before I upgraded from kafka 2.0.1. > Brokers occasionally (randomly?) dropping out of clusters > --------------------------------------------------------- > > Key: KAFKA-9044 > URL: https://issues.apache.org/jira/browse/KAFKA-9044 > Project: Kafka > Issue Type: Bug > Components: core > Affects Versions: 2.3.0, 2.3.1 > Environment: Ubuntu 14.04 > Reporter: Peter Bukowinski > Priority: Major > > I have several cluster running kafka 2.3.1 and this issue has affected all of > them. Because of replication and the size of the clusters (30 brokers), this > bug is not causing any data loss, but it is nevertheless concerning. When a > broker drops out, the log gives no indication that there are any zookeeper > issues (and indeed the zookeepers are healthy when this occurs. Here's > snippet from a broker log when it occurs: > {{[2019-10-07 11:02:27,630] INFO [GroupMetadataManager brokerId=14] Removed 0 > expired offsets in 0 milliseconds. > (kafka.coordinator.group.GroupMetadataManager)}} > {{[2019-10-07 11:02:56,936] INFO [Log partition=internal_test-52, > dir=/data/3/kl] Found deletable segments with base offsets [1975332] due to > retention time 3600000ms breach (kafka.log.Log)}} > {{[2019-10-07 11:02:56,936] INFO [Log partition=internal_test-52, > dir=/data/3/kl] Scheduling log segment [baseOffset 1975332, size 92076008] > for deletion. (kafka.log.Log)}} > {{[2019-10-07 11:02:56,936] INFO [Log partition=internal_test-52, > dir=/data/3/kl] Incrementing log start offset to 2000317 (kafka.log.Log)}} > {{[2019-10-07 11:03:56,936] INFO [Log partition=internal_test-52, > dir=/data/3/kl] Deleting segment 1975332 (kafka.log.Log)}} > {{[2019-10-07 11:03:56,957] INFO Deleted log > /data/3/kl/internal_test-52/00000000000001975332.log.deleted. > (kafka.log.LogSegment)}} > {{[2019-10-07 11:03:56,957] INFO Deleted offset index > /data/3/kl/internal_test-52/00000000000001975332.index.deleted. > (kafka.log.LogSegment)}} > {{[2019-10-07 11:03:56,958] INFO Deleted time index > /data/3/kl/internal_test-52/00000000000001975332.timeindex.deleted. > (kafka.log.LogSegment)}} > {{[2019-10-07 11:12:27,630] INFO [GroupMetadataManager brokerId=14] Removed > 0 expired offsets in 0 milliseconds. > (kafka.coordinator.group.GroupMetadataManager)}} > {{[2019-10-07 11:22:27,629] INFO [GroupMetadataManager brokerId=14] Removed > 0 expired offsets in 0 milliseconds. > (kafka.coordinator.group.GroupMetadataManager)}} > {{[2019-10-07 11:32:27,629] INFO [GroupMetadataManager brokerId=14] Removed > 0 expired offsets in 0 milliseconds. > (kafka.coordinator.group.GroupMetadataManager)}} > {{[2019-10-07 11:42:27,629] INFO [GroupMetadataManager brokerId=14] Removed > 0 expired offsets in 0 milliseconds. > (kafka.coordinator.group.GroupMetadataManager)}} > {{[2019-10-07 11:52:27,629] INFO [GroupMetadataManager brokerId=14] Removed > 0 expired offsets in 0 milliseconds. > (kafka.coordinator.group.GroupMetadataManager)}} > {{[2019-10-07 12:02:27,629] INFO [GroupMetadataManager brokerId=14] Removed > 0 expired offsets in 0 milliseconds. > (kafka.coordinator.group.GroupMetadataManager)}} > {{[2019-10-07 12:12:27,630] INFO [GroupMetadataManager brokerId=14] Removed > 0 expired offsets in 0 milliseconds. > (kafka.coordinator.group.GroupMetadataManager)}} > {{[2019-10-07 12:22:27,629] INFO [GroupMetadataManager brokerId=14] Removed > 0 expired offsets in 0 milliseconds. > (kafka.coordinator.group.GroupMetadataManager)}} > {{[2019-10-07 12:32:27,629] INFO [GroupMetadataManager brokerId=14] Removed > 0 expired offsets in 0 milliseconds. > (kafka.coordinator.group.GroupMetadataManager)}} > {{[2019-10-07 12:42:27,630] INFO [GroupMetadataManager brokerId=14] Removed > 0 expired offsets in 1 milliseconds. > (kafka.coordinator.group.GroupMetadataManager)}} > {{[2019-10-07 12:52:27,629] INFO [GroupMetadataManager brokerId=14] Removed > 0 expired offsets in 0 milliseconds. > (kafka.coordinator.group.GroupMetadataManager)}} > {{[2019-10-07 13:02:27,630] INFO [GroupMetadataManager brokerId=14] Removed > 0 expired offsets in 0 milliseconds. > (kafka.coordinator.group.GroupMetadataManager)}} > {{[2019-10-07 13:12:27,629] INFO [GroupMetadataManager brokerId=14] Removed > 0 expired offsets in 0 milliseconds. > (kafka.coordinator.group.GroupMetadataManager)}} > {{[2019-10-07 13:22:27,629] INFO [GroupMetadataManager brokerId=14] Removed > 0 expired offsets in 0 milliseconds. > (kafka.coordinator.group.GroupMetadataManager)}} > {{[2019-10-07 13:32:27,629] INFO [GroupMetadataManager brokerId=14] Removed > 0 expired offsets in 0 milliseconds. > (kafka.coordinator.group.GroupMetadataManager)}} > {{[2019-10-07 13:42:27,629] INFO [GroupMetadataManager brokerId=14] Removed > 0 expired offsets in 0 milliseconds. > (kafka.coordinator.group.GroupMetadataManager)}} > {{[2019-10-07 13:52:27,630] INFO [GroupMetadataManager brokerId=14] Removed > 0 expired offsets in 1 milliseconds. > (kafka.coordinator.group.GroupMetadataManager)}} > {{[2019-10-07 14:02:27,629] INFO [GroupMetadataManager brokerId=14] Removed > 0 expired offsets in 0 milliseconds. > (kafka.coordinator.group.GroupMetadataManager)}} > {{[2019-10-07 14:12:27,629] INFO [GroupMetadataManager brokerId=14] Removed > 0 expired offsets in 0 milliseconds. > (kafka.coordinator.group.GroupMetadataManager)}} > {{[2019-10-07 14:22:27,630] INFO [GroupMetadataManager brokerId=14] Removed > 0 expired offsets in 1 milliseconds. > (kafka.coordinator.group.GroupMetadataManager)}} > {{[2019-10-07 14:32:27,629] INFO [GroupMetadataManager brokerId=14] Removed > 0 expired offsets in 0 milliseconds. > (kafka.coordinator.group.GroupMetadataManager)}} > {{[2019-10-07 14:42:27,629] INFO [GroupMetadataManager brokerId=14] Removed > 0 expired offsets in 0 milliseconds. > (kafka.coordinator.group.GroupMetadataManager)}} > {{[2019-10-07 14:46:07,510] INFO [Partition internal_test-33 broker=14] > Shrinking ISR from 16,17,14 to 14. Leader: (highWatermark: 2007553, > endOffset: 2007555). Out of sync replicas: (brokerId: 16, endOffset: 2007553) > (brokerId: 17, endOffset: 2007553). (kafka.cluster.Partition)}} > {{[2019-10-07 14:46:07,511] INFO [Partition internal_test-33 broker=14] > Cached zkVersion [17] not equal to that in zookeeper, skip updating ISR > (kafka.cluster.Partition)}} > The controller log shows the following: > {{[2019-10-07 14:45:55,427] INFO [Controller id=24] Newly added brokers: , > deleted brokers: 14, bounced brokers: , all live brokers: > 1,2,3,4,5,6,7,8,9,10,11,12,13,15,16,17,18,19,20,21,22,23,24,25 > (kafka.controller.KafkaController)}} > {{[2019-10-07 14:45:55,477] INFO [RequestSendThread controllerId=24] Shutting > down (kafka.controller.RequestSendThread)}} > {{[2019-10-07 14:45:55,477] INFO [RequestSendThread controllerId=24] Shutdown > completed (kafka.controller.RequestSendThread)}} > {{[2019-10-07 14:45:55,477] INFO [RequestSendThread controllerId=24] Stopped > (kafka.controller.RequestSendThread)}} > {{[2019-10-07 14:45:55,481] INFO [Controller id=24] Broker failure callback > for 14 (kafka.controller.KafkaController)}} > The clusters use {{zookeeper.session.timeout.ms=45000}}, and > {{zookeeper.connection.timeout.ms=90000}}. > I'm unable to find a cause for this behavior. The only thing I can do to > resolve the issue is to restart the broker. -- This message was sent by Atlassian Jira (v8.3.4#803005)