[ 
https://issues.apache.org/jira/browse/KAFKA-16296?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Colin Leroy updated KAFKA-16296:
--------------------------------
    Description: 
We have a rolling-restart problem we don't understand on a 3-node cluster.

When stopping a broker, everything goes fine and the partitions are reassigned 
to the other brokers.

When that broker restarts, it shrinks ISR because of "Out of sync replicas":
{code:java}
[2024-02-22 10:18:02,069] INFO [Partition OSS.PREPROD.Monitoring.Metric-5 
broker=3] Shrinking ISR from 2,1,3 to 3. Leader: (highWatermark: 704389542, 
endOffset: 704395843). Out of sync replicas: (brokerId: 2, endOffset: -1, 
lastCaughtUpTimeMs: 1708593437335) (brokerId: 1, endOffset: -1, 
lastCaughtUpTimeMs: 1708593437335). (kafka.cluster.Partition)

[2024-02-22 10:18:02,124] INFO [Partition OSS.PREPROD.Monitoring.Metric-5 
broker=3] ISR updated to 3 (under-min-isr) and version updated to 1075 
(kafka.cluster.Partition) {code}
I do not understand why brokers 1 and 2 would be out of sync, it seems to me 
that given that brokers 1 and 2 were not restarted, they should be in sync.

This, of course, causes problems as producers reconnect to broker 3 only to 
find the min ISR requirement is not fullfilled.

I have attached the logs for one of the affected partitions, both from broker 3 
(the restarted one) and broker 2 (not restarted).

Thanks in advance,

Colin

  was:
We have a rolling-restart problem we don't understand on a 3-node cluster.

When stopping a broker, everything goes fine and the partitions are reassigned 
to the other brokers.

When that broker restarts, it shrinks ISR because of "Out of sync replicas":
{code:java}
[2024-02-22 10:18:02,069] INFO [Partition OSS.PREPROD.Monitoring.Metric-5 
broker=3] Shrinking ISR from 2,1,3 to 3. Leader: (highWatermark: 704389542, 
endOffset: 704395843). Out of sync replicas: (brokerId: 2, endOffset: -1, 
lastCaughtUpTimeMs: 1708593437335) (brokerId: 1, endOffset: -1, 
lastCaughtUpTimeMs: 1708593437335). (kafka.cluster.Partition)

[2024-02-22 10:18:02,124] INFO [Partition OSS.PREPROD.Monitoring.Metric-5 
broker=3] ISR updated to 3 (under-min-isr) and version updated to 1075 
(kafka.cluster.Partition) {code}
I do not understand why brokers 1 and 2 would be out of sync, it seems to me 
that given that brokers 1 and 2 were not restarted, they should be in sync.

This, of course, causes problems as producers reconnect to broker 3 only to 
find the min ISR requirement is not fullfilled.

Thanks in advance,

Colin


> Broker shrinks ISR when restarting
> ----------------------------------
>
>                 Key: KAFKA-16296
>                 URL: https://issues.apache.org/jira/browse/KAFKA-16296
>             Project: Kafka
>          Issue Type: Bug
>          Components: core
>    Affects Versions: 3.6.1
>            Reporter: Colin Leroy
>            Priority: Major
>         Attachments: broker2.log, broker3.log
>
>
> We have a rolling-restart problem we don't understand on a 3-node cluster.
> When stopping a broker, everything goes fine and the partitions are 
> reassigned to the other brokers.
> When that broker restarts, it shrinks ISR because of "Out of sync replicas":
> {code:java}
> [2024-02-22 10:18:02,069] INFO [Partition OSS.PREPROD.Monitoring.Metric-5 
> broker=3] Shrinking ISR from 2,1,3 to 3. Leader: (highWatermark: 704389542, 
> endOffset: 704395843). Out of sync replicas: (brokerId: 2, endOffset: -1, 
> lastCaughtUpTimeMs: 1708593437335) (brokerId: 1, endOffset: -1, 
> lastCaughtUpTimeMs: 1708593437335). (kafka.cluster.Partition)
> [2024-02-22 10:18:02,124] INFO [Partition OSS.PREPROD.Monitoring.Metric-5 
> broker=3] ISR updated to 3 (under-min-isr) and version updated to 1075 
> (kafka.cluster.Partition) {code}
> I do not understand why brokers 1 and 2 would be out of sync, it seems to me 
> that given that brokers 1 and 2 were not restarted, they should be in sync.
> This, of course, causes problems as producers reconnect to broker 3 only to 
> find the min ISR requirement is not fullfilled.
> I have attached the logs for one of the affected partitions, both from broker 
> 3 (the restarted one) and broker 2 (not restarted).
> Thanks in advance,
> Colin



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to