[ https://issues.apache.org/jira/browse/KAFKA-16296?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Colin Leroy updated KAFKA-16296: -------------------------------- Description: We have a rolling-restart problem we don't understand on a 3-node cluster. When stopping a broker, everything goes fine and the partitions are reassigned to the other brokers. When that broker restarts, it shrinks ISR because of "Out of sync replicas": {code:java} [2024-02-22 10:18:02,069] INFO [Partition OSS.PREPROD.Monitoring.Metric-5 broker=3] Shrinking ISR from 2,1,3 to 3. Leader: (highWatermark: 704389542, endOffset: 704395843). Out of sync replicas: (brokerId: 2, endOffset: -1, lastCaughtUpTimeMs: 1708593437335) (brokerId: 1, endOffset: -1, lastCaughtUpTimeMs: 1708593437335). (kafka.cluster.Partition) [2024-02-22 10:18:02,124] INFO [Partition OSS.PREPROD.Monitoring.Metric-5 broker=3] ISR updated to 3 (under-min-isr) and version updated to 1075 (kafka.cluster.Partition) {code} I do not understand why brokers 1 and 2 would be out of sync, it seems to me that given that brokers 1 and 2 were not restarted, they should be in sync. This, of course, causes problems as producers reconnect to broker 3 only to find the min ISR requirement is not fullfilled. I have attached the logs for one of the affected partitions, both from broker 3 (the restarted one) and broker 2 (not restarted). Thanks in advance, Colin was: We have a rolling-restart problem we don't understand on a 3-node cluster. When stopping a broker, everything goes fine and the partitions are reassigned to the other brokers. When that broker restarts, it shrinks ISR because of "Out of sync replicas": {code:java} [2024-02-22 10:18:02,069] INFO [Partition OSS.PREPROD.Monitoring.Metric-5 broker=3] Shrinking ISR from 2,1,3 to 3. Leader: (highWatermark: 704389542, endOffset: 704395843). Out of sync replicas: (brokerId: 2, endOffset: -1, lastCaughtUpTimeMs: 1708593437335) (brokerId: 1, endOffset: -1, lastCaughtUpTimeMs: 1708593437335). (kafka.cluster.Partition) [2024-02-22 10:18:02,124] INFO [Partition OSS.PREPROD.Monitoring.Metric-5 broker=3] ISR updated to 3 (under-min-isr) and version updated to 1075 (kafka.cluster.Partition) {code} I do not understand why brokers 1 and 2 would be out of sync, it seems to me that given that brokers 1 and 2 were not restarted, they should be in sync. This, of course, causes problems as producers reconnect to broker 3 only to find the min ISR requirement is not fullfilled. Thanks in advance, Colin > Broker shrinks ISR when restarting > ---------------------------------- > > Key: KAFKA-16296 > URL: https://issues.apache.org/jira/browse/KAFKA-16296 > Project: Kafka > Issue Type: Bug > Components: core > Affects Versions: 3.6.1 > Reporter: Colin Leroy > Priority: Major > Attachments: broker2.log, broker3.log > > > We have a rolling-restart problem we don't understand on a 3-node cluster. > When stopping a broker, everything goes fine and the partitions are > reassigned to the other brokers. > When that broker restarts, it shrinks ISR because of "Out of sync replicas": > {code:java} > [2024-02-22 10:18:02,069] INFO [Partition OSS.PREPROD.Monitoring.Metric-5 > broker=3] Shrinking ISR from 2,1,3 to 3. Leader: (highWatermark: 704389542, > endOffset: 704395843). Out of sync replicas: (brokerId: 2, endOffset: -1, > lastCaughtUpTimeMs: 1708593437335) (brokerId: 1, endOffset: -1, > lastCaughtUpTimeMs: 1708593437335). (kafka.cluster.Partition) > [2024-02-22 10:18:02,124] INFO [Partition OSS.PREPROD.Monitoring.Metric-5 > broker=3] ISR updated to 3 (under-min-isr) and version updated to 1075 > (kafka.cluster.Partition) {code} > I do not understand why brokers 1 and 2 would be out of sync, it seems to me > that given that brokers 1 and 2 were not restarted, they should be in sync. > This, of course, causes problems as producers reconnect to broker 3 only to > find the min ISR requirement is not fullfilled. > I have attached the logs for one of the affected partitions, both from broker > 3 (the restarted one) and broker 2 (not restarted). > Thanks in advance, > Colin -- This message was sent by Atlassian Jira (v8.20.10#820010)