[ 
https://issues.apache.org/jira/browse/KAFKA-9044?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16968593#comment-16968593
 ] 

Peter Bukowinski commented on KAFKA-9044:
-----------------------------------------

I had another drop-out today. Here is the controller's log with context:

{{[2019-10-29 18:10:47,350] INFO [Controller id=29] Partition ac-50 completed 
preferred replica leader election. New leader is 15 
(kafka.controller.KafkaController)
[2019-10-29 18:10:47,350] INFO [Controller id=29] Partition Sa-39 completed 
preferred replica leader election. New leader is 7 
(kafka.controller.KafkaController)
[2019-10-29 18:10:47,350] INFO [Controller id=29] Partition cr-58 completed 
preferred replica leader election. New leader is 27 
(kafka.controller.KafkaController)
[2019-10-29 18:10:47,350] INFO [Controller id=29] Partition Sa-53 completed 
preferred replica leader election. New leader is 26 
(kafka.controller.KafkaController)
[2019-10-29 18:10:47,350] INFO [Controller id=29] Partition Fe-105 completed 
preferred replica leader election. New leader is 25 
(kafka.controller.KafkaController)
[2019-10-29 18:10:47,350] INFO [Controller id=29] Partition En-29 completed 
preferred replica leader election. New leader is 9 
(kafka.controller.KafkaController)
[2019-10-29 18:10:47,350] INFO [Controller id=29] Partition Fe-130 completed 
preferred replica leader election. New leader is 23 
(kafka.controller.KafkaController)
[2019-10-29 18:10:47,350] INFO [Controller id=29] Partition Sa-57 completed 
preferred replica leader election. New leader is 17 
(kafka.controller.KafkaController)
[2019-10-29 18:10:47,350] INFO [Controller id=29] Partition En-73 completed 
preferred replica leader election. New leader is 28 
(kafka.controller.KafkaController)
[2019-10-29 18:10:47,353] DEBUG [Controller id=29] 
PreferredReplicaLeaderElection waiting: Set(...redacted...), results: Map() 
(kafka.controller.KafkaController)
[2019-11-06 03:24:46,122] INFO [Controller id=29] Newly added brokers: , 
deleted brokers: 23, bounced brokers: , all live brokers: 
1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,24,25,26,27,28,29,30 
(kafka.controller.KafkaController)
[2019-11-06 03:24:46,200] INFO [RequestSendThread controllerId=29] Shutting 
down (kafka.controller.RequestSendThread)
[2019-11-06 03:24:46,200] INFO [RequestSendThread controllerId=29] Stopped 
(kafka.controller.RequestSendThread)
[2019-11-06 03:24:46,201] INFO [RequestSendThread controllerId=29] Shutdown 
completed (kafka.controller.RequestSendThread)
[2019-11-06 03:24:46,206] INFO [Controller id=29] Broker failure callback for 
23 (kafka.controller.KafkaController)
[2019-11-06 03:24:46,272] INFO [RequestSendThread controllerId=29] Controller 
29 connected to kafka0029.example.com:9092 (id: 29 rack: REDACTED) for sending 
state change requests (kafka.controller.RequestSendThread)
[2019-11-06 03:24:46,274] INFO [RequestSendThread controllerId=29] Controller 
29 connected to kafka0005.example.com:9092 (id: 5 rack: REDACTED) for sending 
state change requests (kafka.controller.RequestSendThread)
[2019-11-06 03:24:46,274] INFO [RequestSendThread controllerId=29] Controller 
29 connected to kafka0019.example.com:9092 (id: 19 rack: REDACTED) for sending 
state change requests (kafka.controller.RequestSendThread)
[2019-11-06 03:24:46,274] INFO [RequestSendThread controllerId=29] Controller 
29 connected to kafka0010.example.com:9092 (id: 10 rack: REDACTED) for sending 
state change requests (kafka.controller.RequestSendThread)
[2019-11-06 03:24:46,275] INFO [RequestSendThread controllerId=29] Controller 
29 connected to kafka0016.example.com:9092 (id: 16 rack: REDACTED) for sending 
state change requests (kafka.controller.RequestSendThread)
[2019-11-06 03:24:46,275] INFO [RequestSendThread controllerId=29] Controller 
29 connected to kafka0007.example.com:9092 (id: 7 rack: REDACTED) for sending 
state change requests (kafka.controller.RequestSendThread)
[2019-11-06 03:24:46,275] INFO [RequestSendThread controllerId=29] Controller 
29 connected to kafka0011.example.com:9092 (id: 11 rack: REDACTED) for sending 
state change requests (kafka.controller.RequestSendThread)
[2019-11-06 03:24:46,275] INFO [RequestSendThread controllerId=29] Controller 
29 connected to kafka0001.example.com:9092 (id: 1 rack: REDACTED) for sending 
state change requests (kafka.controller.RequestSendThread)
[2019-11-06 03:24:46,275] INFO [RequestSendThread controllerId=29] Controller 
29 connected to kafka0020.example.com:9092 (id: 20 rack: REDACTED) for sending 
state change requests (kafka.controller.RequestSendThread)
[2019-11-06 03:24:46,275] INFO [RequestSendThread controllerId=29] Controller 
29 connected to kafka0018.example.com:9092 (id: 18 rack: REDACTED) for sending 
state change requests (kafka.controller.RequestSendThread)}}

Here is the broker log, with context:

{{[2019-11-06 03:16:11,811] INFO [Log partition=Fe-164, dir=/data/3/kl] Found 
deletable segments with base offsets [235676977] due to retention time 
259200000ms breach (kafka.log.Log)
[2019-11-06 03:16:11,812] INFO [Log partition=Fe-164, dir=/data/3/kl] 
Scheduling log segment [baseOffset 235676977, size 1073605838] for deletion. 
(kafka.log.Log)
[2019-11-06 03:17:11,812] INFO [Log partition=Fe-164, dir=/data/3/kl] Deleting 
segment 235676977 (kafka.log.Log)
[2019-11-06 03:17:12,149] INFO Deleted log 
/data/3/kl/Fe-164/00000000000235676977.log.deleted. (kafka.log.LogSegment)
[2019-11-06 03:17:12,149] INFO Deleted offset index 
/data/3/kl/Fe-164/00000000000235676977.index.deleted. (kafka.log.LogSegment)
[2019-11-06 03:17:12,149] INFO Deleted time index 
/data/3/kl/Fe-164/00000000000235676977.timeindex.deleted. (kafka.log.LogSegment)
[2019-11-06 03:19:12,101] INFO [ProducerStateManager partition=Fe-163] Writing 
producer snapshot at offset 243728101 (kafka.log.ProducerStateManager)
[2019-11-06 03:19:12,102] INFO [Log partition=Fe-163, dir=/data/1/kl] Rolled 
new log segment at offset 243728101 in 1 ms. (kafka.log.Log)
[2019-11-06 03:19:43,422] INFO [ProducerStateManager partition=se-133] Writing 
producer snapshot at offset 25042454 (kafka.log.ProducerStateManager)
[2019-11-06 03:19:43,422] INFO [Log partition=se-133, dir=/data/1/kl] Rolled 
new log segment at offset 25042454 in 0 ms. (kafka.log.Log)
[2019-11-06 03:22:09,511] INFO [ProducerStateManager partition=se-169] Writing 
producer snapshot at offset 25037202 (kafka.log.ProducerStateManager)
[2019-11-06 03:22:09,511] INFO [Log partition=se-169, dir=/data/3/kl] Rolled 
new log segment at offset 25037202 in 0 ms. (kafka.log.Log)
[2019-11-06 03:24:57,452] INFO [Partition Fe-40 broker=23] Shrinking ISR from 
16,17,23 to 23. Leader: (highWatermark: 243730381, endOffset: 243730650). Out 
of sync replicas: (brokerId: 16, endOffset: 243730381) (brokerId: 17, 
endOffset: 243730382). (kafka.cluster.Partition)
[2019-11-06 03:24:57,525] INFO [Partition Fe-40 broker=23] Cached zkVersion 
[27] not equal to that in zookeeper, skip updating ISR (kafka.cluster.Partition)
[2019-11-06 03:24:57,526] INFO [Partition ac-450 broker=23] Shrinking ISR from 
4,5,23 to 23. Leader: (highWatermark: 187730758, endOffset: 187731018). Out of 
sync replicas: (brokerId: 4, endOffset: 187730773) (brokerId: 5, endOffset: 
187730758). (kafka.cluster.Partition)
[2019-11-06 03:24:57,528] INFO [Partition ac-450 broker=23] Cached zkVersion 
[27] not equal to that in zookeeper, skip updating ISR (kafka.cluster.Partition)
[2019-11-06 03:24:57,529] INFO [Partition fb-46 broker=23] Shrinking ISR from 
12,13,23 to 23. Leader: (highWatermark: 3543421, endOffset: 3543424). Out of 
sync replicas: (brokerId: 12, endOffset: 3543421) (brokerId: 13, endOffset: 
3543421). (kafka.cluster.Partition)
[2019-11-06 03:24:57,531] INFO [Partition fb-46 broker=23] Cached zkVersion 
[27] not equal to that in zookeeper, skip updating ISR (kafka.cluster.Partition)
[2019-11-06 03:24:57,532] INFO [Partition ac-120 broker=23] Shrinking ISR from 
23,29,30 to 23. Leader: (highWatermark: 187738210, endOffset: 187738437). Out 
of sync replicas: (brokerId: 29, endOffset: 187738212) (brokerId: 30, 
endOffset: 187738210). (kafka.cluster.Partition)
[2019-11-06 03:24:57,535] INFO [Partition ac-120 broker=23] Cached zkVersion 
[25] not equal to that in zookeeper, skip updating ISR (kafka.cluster.Partition)
[2019-11-06 03:24:57,535] INFO [Partition se-37 broker=23] Shrinking ISR from 
22,23,24 to 23. Leader: (highWatermark: 25042854, endOffset: 25042891). Out of 
sync replicas: (brokerId: 22, endOffset: 25042854) (brokerId: 24, endOffset: 
25042855). (kafka.cluster.Partition)
[2019-11-06 03:24:57,538] INFO [Partition ses-37 broker=23] Cached zkVersion 
[31] not equal to that in zookeeper, skip updating ISR 
(kafka.cluster.Partition)}}

The state-change log of the affected broker has no entries in this time period. 
The broker's id not listed in zookeeper. As you can see, everything looks 
normal and then the broker is kicked out of the cluster. This has occurred on 
16 clusters in  separate datacenters since I started tracking it.

I'd like to stress that it did not occur before I upgraded from kafka 2.0.1.


> Brokers occasionally (randomly?) dropping out of clusters
> ---------------------------------------------------------
>
>                 Key: KAFKA-9044
>                 URL: https://issues.apache.org/jira/browse/KAFKA-9044
>             Project: Kafka
>          Issue Type: Bug
>          Components: core
>    Affects Versions: 2.3.0, 2.3.1
>         Environment: Ubuntu 14.04
>            Reporter: Peter Bukowinski
>            Priority: Major
>
> I have several cluster running kafka 2.3.1 and this issue has affected all of 
> them. Because of replication and the size of the clusters (30 brokers), this 
> bug is not causing any data loss, but it is nevertheless concerning. When a 
> broker drops out, the log gives no indication that there are any zookeeper 
> issues (and indeed the zookeepers are healthy when this occurs. Here's 
> snippet from a broker log when it occurs:
> {{[2019-10-07 11:02:27,630] INFO [GroupMetadataManager brokerId=14] Removed 0 
> expired offsets in 0 milliseconds. 
> (kafka.coordinator.group.GroupMetadataManager)}}
>  {{[2019-10-07 11:02:56,936] INFO [Log partition=internal_test-52, 
> dir=/data/3/kl] Found deletable segments with base offsets [1975332] due to 
> retention time 3600000ms breach (kafka.log.Log)}}
>  {{[2019-10-07 11:02:56,936] INFO [Log partition=internal_test-52, 
> dir=/data/3/kl] Scheduling log segment [baseOffset 1975332, size 92076008] 
> for deletion. (kafka.log.Log)}}
>  {{[2019-10-07 11:02:56,936] INFO [Log partition=internal_test-52, 
> dir=/data/3/kl] Incrementing log start offset to 2000317 (kafka.log.Log)}}
>  {{[2019-10-07 11:03:56,936] INFO [Log partition=internal_test-52, 
> dir=/data/3/kl] Deleting segment 1975332 (kafka.log.Log)}}
>  {{[2019-10-07 11:03:56,957] INFO Deleted log 
> /data/3/kl/internal_test-52/00000000000001975332.log.deleted. 
> (kafka.log.LogSegment)}}
>  {{[2019-10-07 11:03:56,957] INFO Deleted offset index 
> /data/3/kl/internal_test-52/00000000000001975332.index.deleted. 
> (kafka.log.LogSegment)}}
>  {{[2019-10-07 11:03:56,958] INFO Deleted time index 
> /data/3/kl/internal_test-52/00000000000001975332.timeindex.deleted. 
> (kafka.log.LogSegment)}}
>  {{[2019-10-07 11:12:27,630] INFO [GroupMetadataManager brokerId=14] Removed 
> 0 expired offsets in 0 milliseconds. 
> (kafka.coordinator.group.GroupMetadataManager)}}
>  {{[2019-10-07 11:22:27,629] INFO [GroupMetadataManager brokerId=14] Removed 
> 0 expired offsets in 0 milliseconds. 
> (kafka.coordinator.group.GroupMetadataManager)}}
>  {{[2019-10-07 11:32:27,629] INFO [GroupMetadataManager brokerId=14] Removed 
> 0 expired offsets in 0 milliseconds. 
> (kafka.coordinator.group.GroupMetadataManager)}}
>  {{[2019-10-07 11:42:27,629] INFO [GroupMetadataManager brokerId=14] Removed 
> 0 expired offsets in 0 milliseconds. 
> (kafka.coordinator.group.GroupMetadataManager)}}
>  {{[2019-10-07 11:52:27,629] INFO [GroupMetadataManager brokerId=14] Removed 
> 0 expired offsets in 0 milliseconds. 
> (kafka.coordinator.group.GroupMetadataManager)}}
>  {{[2019-10-07 12:02:27,629] INFO [GroupMetadataManager brokerId=14] Removed 
> 0 expired offsets in 0 milliseconds. 
> (kafka.coordinator.group.GroupMetadataManager)}}
>  {{[2019-10-07 12:12:27,630] INFO [GroupMetadataManager brokerId=14] Removed 
> 0 expired offsets in 0 milliseconds. 
> (kafka.coordinator.group.GroupMetadataManager)}}
>  {{[2019-10-07 12:22:27,629] INFO [GroupMetadataManager brokerId=14] Removed 
> 0 expired offsets in 0 milliseconds. 
> (kafka.coordinator.group.GroupMetadataManager)}}
>  {{[2019-10-07 12:32:27,629] INFO [GroupMetadataManager brokerId=14] Removed 
> 0 expired offsets in 0 milliseconds. 
> (kafka.coordinator.group.GroupMetadataManager)}}
>  {{[2019-10-07 12:42:27,630] INFO [GroupMetadataManager brokerId=14] Removed 
> 0 expired offsets in 1 milliseconds. 
> (kafka.coordinator.group.GroupMetadataManager)}}
>  {{[2019-10-07 12:52:27,629] INFO [GroupMetadataManager brokerId=14] Removed 
> 0 expired offsets in 0 milliseconds. 
> (kafka.coordinator.group.GroupMetadataManager)}}
>  {{[2019-10-07 13:02:27,630] INFO [GroupMetadataManager brokerId=14] Removed 
> 0 expired offsets in 0 milliseconds. 
> (kafka.coordinator.group.GroupMetadataManager)}}
>  {{[2019-10-07 13:12:27,629] INFO [GroupMetadataManager brokerId=14] Removed 
> 0 expired offsets in 0 milliseconds. 
> (kafka.coordinator.group.GroupMetadataManager)}}
>  {{[2019-10-07 13:22:27,629] INFO [GroupMetadataManager brokerId=14] Removed 
> 0 expired offsets in 0 milliseconds. 
> (kafka.coordinator.group.GroupMetadataManager)}}
>  {{[2019-10-07 13:32:27,629] INFO [GroupMetadataManager brokerId=14] Removed 
> 0 expired offsets in 0 milliseconds. 
> (kafka.coordinator.group.GroupMetadataManager)}}
>  {{[2019-10-07 13:42:27,629] INFO [GroupMetadataManager brokerId=14] Removed 
> 0 expired offsets in 0 milliseconds. 
> (kafka.coordinator.group.GroupMetadataManager)}}
>  {{[2019-10-07 13:52:27,630] INFO [GroupMetadataManager brokerId=14] Removed 
> 0 expired offsets in 1 milliseconds. 
> (kafka.coordinator.group.GroupMetadataManager)}}
>  {{[2019-10-07 14:02:27,629] INFO [GroupMetadataManager brokerId=14] Removed 
> 0 expired offsets in 0 milliseconds. 
> (kafka.coordinator.group.GroupMetadataManager)}}
>  {{[2019-10-07 14:12:27,629] INFO [GroupMetadataManager brokerId=14] Removed 
> 0 expired offsets in 0 milliseconds. 
> (kafka.coordinator.group.GroupMetadataManager)}}
>  {{[2019-10-07 14:22:27,630] INFO [GroupMetadataManager brokerId=14] Removed 
> 0 expired offsets in 1 milliseconds. 
> (kafka.coordinator.group.GroupMetadataManager)}}
>  {{[2019-10-07 14:32:27,629] INFO [GroupMetadataManager brokerId=14] Removed 
> 0 expired offsets in 0 milliseconds. 
> (kafka.coordinator.group.GroupMetadataManager)}}
>  {{[2019-10-07 14:42:27,629] INFO [GroupMetadataManager brokerId=14] Removed 
> 0 expired offsets in 0 milliseconds. 
> (kafka.coordinator.group.GroupMetadataManager)}}
>  {{[2019-10-07 14:46:07,510] INFO [Partition internal_test-33 broker=14] 
> Shrinking ISR from 16,17,14 to 14. Leader: (highWatermark: 2007553, 
> endOffset: 2007555). Out of sync replicas: (brokerId: 16, endOffset: 2007553) 
> (brokerId: 17, endOffset: 2007553). (kafka.cluster.Partition)}}
>  {{[2019-10-07 14:46:07,511] INFO [Partition internal_test-33 broker=14] 
> Cached zkVersion [17] not equal to that in zookeeper, skip updating ISR 
> (kafka.cluster.Partition)}}
> The controller log shows the following:
> {{[2019-10-07 14:45:55,427] INFO [Controller id=24] Newly added brokers: , 
> deleted brokers: 14, bounced brokers: , all live brokers: 
> 1,2,3,4,5,6,7,8,9,10,11,12,13,15,16,17,18,19,20,21,22,23,24,25 
> (kafka.controller.KafkaController)}}
> {{[2019-10-07 14:45:55,477] INFO [RequestSendThread controllerId=24] Shutting 
> down (kafka.controller.RequestSendThread)}}
> {{[2019-10-07 14:45:55,477] INFO [RequestSendThread controllerId=24] Shutdown 
> completed (kafka.controller.RequestSendThread)}}
> {{[2019-10-07 14:45:55,477] INFO [RequestSendThread controllerId=24] Stopped 
> (kafka.controller.RequestSendThread)}}
> {{[2019-10-07 14:45:55,481] INFO [Controller id=24] Broker failure callback 
> for 14 (kafka.controller.KafkaController)}}
> The clusters use {{zookeeper.session.timeout.ms=45000}}, and 
> {{zookeeper.connection.timeout.ms=90000}}.
> I'm unable to find a cause for this behavior. The only thing I can do to 
> resolve the issue is to restart the broker.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to