[ 
https://issues.apache.org/jira/browse/KAFKA-13077?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17456515#comment-17456515
 ] 

Shivakumar commented on KAFKA-13077:
------------------------------------

Hi [~junrao] 

here is the summary of our issue, hope you can help us here 

Kafka(2.8.1), ZooKeeper(3.6.3) in eks kubernetes 1.19
kafka cluster size = 3
zk cluster size = 3

1) after rolling restart of zk , sometimes all the partitions of the topic 
become out of sync especially for broker 2, ISR=2 and Leader=2 and other 
brokers are out of ISR 
Topic: __consumer_offsets    PartitionCount: 50    ReplicationFactor: 3    
Configs: 
compression.type=producer,cleanup.policy=compact,segment.bytes=104857600
    Topic: __consumer_offsets    Partition: 0    Leader: 2    Replicas: 0,1,2   
 Isr: 2
    Topic: __consumer_offsets    Partition: 1    Leader: 2    Replicas: 1,2,0   
 Isr: 2
    Topic: __consumer_offsets    Partition: 2    Leader: 2    Replicas: 2,0,1   
 Isr: 2,1,0
    Topic: __consumer_offsets    Partition: 3    Leader: 2    Replicas: 0,2,1   
 Isr: 2
    Topic: __consumer_offsets    Partition: 4    Leader: 2    Replicas: 1,0,2   
 Isr: 2
    Topic: __consumer_offsets    Partition: 5    Leader: 2    Replicas: 2,1,0   
 Isr: 2
    Topic: __consumer_offsets    Partition: 6    Leader: 2    Replicas: 0,1,2   
 Isr: 2
    Topic: __consumer_offsets    Partition: 7    Leader: 2    Replicas: 1,2,0   
 Isr: 2
    Topic: __consumer_offsets    Partition: 8    Leader: 2    Replicas: 2,0,1   
 Isr: 2
    Topic: __consumer_offsets    Partition: 9    Leader: 2    Replicas: 0,2,1   
 Isr: 2
    Topic: __consumer_offsets    Partition: 10    Leader: 2    Replicas: 1,0,2  
  Isr: 2
    Topic: __consumer_offsets    Partition: 11    Leader: 2    Replicas: 2,1,0  
  Isr: 2
    Topic: __consumer_offsets    Partition: 12    Leader: 2    Replicas: 0,1,2  
  Isr: 2
    Topic: __consumer_offsets    Partition: 13    Leader: 2    Replicas: 1,2,0  
  Isr: 2
    Topic: __consumer_offsets    Partition: 14    Leader: 2    Replicas: 2,0,1  
  Isr: 2
    Topic: __consumer_offsets    Partition: 15    Leader: 2    Replicas: 0,2,1  
  Isr: 2
    Topic: __consumer_offsets    Partition: 16    Leader: 2    Replicas: 1,0,2  
  Isr: 2
    Topic: __consumer_offsets    Partition: 17    Leader: 2    Replicas: 2,1,0  
  Isr: 2
    Topic: __consumer_offsets    Partition: 18    Leader: 2    Replicas: 0,1,2  
  Isr: 2
    Topic: __consumer_offsets    Partition: 19    Leader: 2    Replicas: 1,2,0  
  Isr: 2
    Topic: __consumer_offsets    Partition: 20    Leader: 2    Replicas: 2,0,1  
  Isr: 2
    Topic: __consumer_offsets    Partition: 21    Leader: 2    Replicas: 0,2,1  
  Isr: 2
    Topic: __consumer_offsets    Partition: 22    Leader: 2    Replicas: 1,0,2  
  Isr: 2
    Topic: __consumer_offsets    Partition: 23    Leader: 2    Replicas: 2,1,0  
  Isr: 2
    Topic: __consumer_offsets    Partition: 24    Leader: 2    Replicas: 0,1,2  
  Isr: 2
    Topic: __consumer_offsets    Partition: 25    Leader: 2    Replicas: 1,2,0  
  Isr: 2
    Topic: __consumer_offsets    Partition: 26    Leader: 2    Replicas: 2,0,1  
  Isr: 2
    Topic: __consumer_offsets    Partition: 27    Leader: 2    Replicas: 0,2,1  
  Isr: 2
    Topic: __consumer_offsets    Partition: 28    Leader: 2    Replicas: 1,0,2  
  Isr: 2
    Topic: __consumer_offsets    Partition: 29    Leader: 2    Replicas: 2,1,0  
  Isr: 2,1,0
    Topic: __consumer_offsets    Partition: 30    Leader: 2    Replicas: 0,1,2  
  Isr: 2
    Topic: __consumer_offsets    Partition: 31    Leader: 2    Replicas: 1,2,0  
  Isr: 2
    Topic: __consumer_offsets    Partition: 32    Leader: 2    Replicas: 2,0,1  
  Isr: 2
    Topic: __consumer_offsets    Partition: 33    Leader: 2    Replicas: 0,2,1  
  Isr: 2
    Topic: __consumer_offsets    Partition: 34    Leader: 2    Replicas: 1,0,2  
  Isr: 2
    Topic: __consumer_offsets    Partition: 35    Leader: 2    Replicas: 2,1,0  
  Isr: 2
    Topic: __consumer_offsets    Partition: 36    Leader: 2    Replicas: 0,1,2  
  Isr: 2
    Topic: __consumer_offsets    Partition: 37    Leader: 2    Replicas: 1,2,0  
  Isr: 2
    Topic: __consumer_offsets    Partition: 38    Leader: 2    Replicas: 2,0,1  
  Isr: 2
    Topic: __consumer_offsets    Partition: 39    Leader: 2    Replicas: 0,2,1  
  Isr: 2
    Topic: __consumer_offsets    Partition: 40    Leader: 2    Replicas: 1,0,2  
  Isr: 2
    Topic: __consumer_offsets    Partition: 41    Leader: 2    Replicas: 2,1,0  
  Isr: 2,1,0
    Topic: __consumer_offsets    Partition: 42    Leader: 2    Replicas: 0,1,2  
  Isr: 2
    Topic: __consumer_offsets    Partition: 43    Leader: 2    Replicas: 1,2,0  
  Isr: 2
    Topic: __consumer_offsets    Partition: 44    Leader: 2    Replicas: 2,0,1  
  Isr: 2
    Topic: __consumer_offsets    Partition: 45    Leader: 2    Replicas: 0,2,1  
  Isr: 2
    Topic: __consumer_offsets    Partition: 46    Leader: 2    Replicas: 1,0,2  
  Isr: 2
    Topic: __consumer_offsets    Partition: 47    Leader: 2    Replicas: 2,1,0  
  Isr: 2
    Topic: __consumer_offsets    Partition: 48    Leader: 2    Replicas: 0,1,2  
  Isr: 2
    Topic: __consumer_offsets    Partition: 49    Leader: 2    Replicas: 1,2,0  
  Isr: 2

2) __consumer_offsets topic and __transaction_state topic also get stuck in 
above condition

3) after executing kafka-run-class we got offset mismatch in 0 and 1 broker 
index file which is not in sync with the broker 2 (leader broker 2 which was 
not showing any mismatch errors).

kafka [ /var/lib/kafka/data/__consumer_offsets-2 ]$ kafka-run-class.sh 
kafka.tools.DumpLogSegments --files 00000000000014022515.index
Dumping 00000000000014022515.index
offset: 14022515 position: 0
Mismatches in 
:/var/lib/kafka/data/__consumer_offsets-2/00000000000014022515.index
  Index offset: 14022515, log offset: 14022523

4) server.properties
broker.id: 0
zookeeper.connect: ops-zk-client:2181/KafkaCluster1
zookeeper.session.timeout.ms: 6000
zookeeper.set.acl: false
broker.rack: ap-southeast-1c
inter.broker.protocol.version: 2.8
background.threads: 10
compression.type: producer
broker.id.generation.enable: false
reserved.broker.max.id: 1000
controlled.shutdown.enable: true
controlled.shutdown.max.retries: 3
controlled.shutdown.retry.backoff.ms: 5000
controller.socket.timeout.ms: 30000
auto.create.topics.enable: true
delete.topic.enable: true
log.dirs: /var/lib/kafka/data
log.message.format.version: 1.1-IV0
log.retention.bytes: -1
log.retention.minutes: 120
log.retention.check.interval.ms: 300000
log.flush.interval.messages: 9223372036854775807
log.flush.offset.checkpoint.interval.ms: 60000
log.flush.scheduler.interval.ms: 9223372036854775807
log.segment.bytes: 1073741824
log.segment.delete.delay.ms: 60000
log.roll.hours: 24
log.roll.jitter.hours: 0
log.cleaner.backoff.ms: 15000
log.cleaner.dedupe.buffer.size: 134217728
log.cleaner.delete.retention.ms: 86400000
log.cleaner.enable: true
log.cleaner.io.buffer.load.factor: 0.9
log.cleaner.io.buffer.size: 524288
log.cleaner.io.max.bytes.per.second: 1.7976931348623157E308
log.cleaner.min.cleanable.ratio: 0.5
log.cleaner.min.compaction.lag.ms: 0
log.cleaner.threads: 1
log.cleanup.policy: delete
log.index.interval.bytes: 4096
log.index.size.max.bytes: 10485760
log.message.timestamp.difference.max.ms: 9223372036854775807
log.message.timestamp.type: CreateTime
log.preallocate: false
listeners: SSL://:9092
auto.leader.rebalance.enable: true
unclean.leader.election.enable: false
leader.imbalance.check.interval.seconds: 300
leader.imbalance.per.broker.percentage: 10
default.replication.factor: 3
num.partitions: 3
min.insync.replicas: 3
offset.metadata.max.bytes: 4096
offsets.commit.required.acks: -1
offsets.commit.timeout.ms: 5000
offsets.load.buffer.size: 5242880
offsets.retention.check.interval.ms: 600000
offsets.retention.minutes: 2880
offsets.topic.compression.codec: 0
offsets.topic.num.partitions: 50
offsets.topic.replication.factor: 3
offsets.topic.segment.bytes: 104857600
quota.consumer.default: 9223372036854775807
consumer.byte.rate: 9223372036854775807
quota.producer.default: 9223372036854775807
producer.byte.rate: 9223372036854775807
replica.fetch.min.bytes: 1
replica.fetch.wait.max.ms: 500
replica.high.watermark.checkpoint.interval.ms: 5000
replica.lag.time.max.ms: 10000
replica.socket.receive.buffer.bytes: 65536
replica.socket.timeout.ms: 30000
replica.fetch.max.bytes: 2097152
replica.fetch.response.max.bytes: 10485760
replica.fetch.backoff.ms: 1000
num.io.threads: 8
num.network.threads: 3
num.recovery.threads.per.data.dir: 1
num.replica.fetchers: 1
message.max.bytes: 2097152
queued.max.requests: 500
request.timeout.ms: 30000
socket.receive.buffer.bytes: 102400
socket.request.max.bytes: 104857600
socket.send.buffer.bytes: 102400
connections.max.idle.ms: 600000
fetch.purgatory.purge.interval.requests: 1000
group.initial.rebalance.delay.ms: 3000
group.max.session.timeout.ms: 300000
group.min.session.timeout.ms: 6000
producer.purgatory.purge.interval.requests: 1000
max.connections.per.ip: 2147483647
security.inter.broker.protocol: SSL
ssl.mount.path: /opt/kafka/ssl
ssl.secure.random.implementation: SHA1PRNG
ssl.keystore.location: /opt/kafka/ssl/server.keystore.jks
ssl.keystore.password: *****
ssl.key.password: *****
ssl.truststore.location: /opt/kafka/ssl/server.truststore.jks
ssl.truststore.password: *****
ssl.endpoint.identification.algorithm:
ssl.client.auth: required

5) we increased the num.replica.fetchers to 3, but this didnot help us.

6)replication is stuck and non leader broker will not come in to ISR
  kafka consumer group describe also gives error and timesout with the 
following error

  Error: Executing consumer group command failed due to 
org.apache.kafka.common.errors.TimeoutException: Call(callName=listOffsets on 
broker 3, deadlineMs=1638858631539, tries=1, nextAllowedTryMs=1638858631643) 
timed out at 1638858631543 after 1 attempt(s)
java.util.concurrent.ExecutionException: 
org.apache.kafka.common.errors.TimeoutException: Call(callName=listOffsets on 
broker 3, deadlineMs=1638858631539, tries=1, nextAllowedTryMs=1638858631643) 
timed out at 1638858631543 after 1 attempt(s)

7) *Only solution left for us was to delete the PVC(kafka data directory) which 
involved data loss*


*Common types of errors during incident :*
 # 2021-11-25 12:14:50,367 [kafka-broker] - INFO  
[ReplicaFetcherThread-0-2:Logging@66] - [ReplicaFetcher replicaId=0, 
leaderId=2, fetcherId=0] Current offset 0 for partition 
lint-archival-config-stream-0 is out of range, which typically implies a leader 
change. Reset fetch offset to 119
 # 2021-11-25 12:13:40,757 [kafka-broker] - ERROR 
[data-plane-kafka-request-handler-0:Logging@76] - [ReplicaManager broker=2] 
Error processing append operation on partition __consumer_offsets-29
 # org.apache.kafka.common.errors.NotEnoughReplicasException: The size of the 
current ISR Set(2) is insufficient to satisfy the min.isr requirement of 2 for 
partition __consumer_offsets-29
 # 2021-11-25 12:19:02,082 [kafka-broker] - WARN  
[ReplicaFetcherThread-0-0:Logging@70] - [Log 
partition=ingestion-pipeline-stream-5, dir=/var/lib/kafka/data] Non-monotonic 
update of high watermark from (offset=23110 segment=[-1:-1]) to (offset=23109 
segment=[-1:-1])
 # 2021-11-25 12:24:26,603 [kafka-broker] - INFO  
[ReplicaFetcherThread-0-2:Logging@66] - [Log 
partition=wavefront-data-archival-stream-2, dir=/var/lib/kafka/data] Truncating 
to 0 has no effect as the largest offset in the log is -1
 # org.apache.kafka.common.errors.InvalidReplicationFactorException: 
Replication factor: 3 larger than available brokers: 0.
Atleast 2 brokers were up all the time 
 # 2021-11-25 12:14:41,299 [kafka-broker] - INFO  
[data-plane-kafka-request-handler-6:Logging@68] - [Admin Manager on Broker 0]: 
Error processing create topic request 
CreatableTopic(name='alert-streams-app-alert-aggr-window-time-store-changelog', 
numPartitions=3, replicationFactor=3, assignments=[], configs=[])
 # 2021-11-25 12:13:23,838 [kafka-broker] - INFO  
[ReplicaFetcherThread-0-2:Logging@66] - [ReplicaFetcher replicaId=1, 
leaderId=2, fetcherId=0] Partition 
alert-streams-app-30min-alert-match-aggregator-store-changelog-32 has an older 
epoch (27) than the current leader. Will await the new LeaderAndIsr state 
before resuming fetching.
 # 2021-11-25 12:13:23,838 [kafka-broker] - INFO  
[ReplicaFetcherThread-0-2:Logging@66] - [ReplicaFetcher replicaId=1, 
leaderId=2, fetcherId=0] Partition 
alert-streams-app-30min-alert-match-aggregator-store-changelog-32 has an older 
epoch (27) than the current leader. Will await the new LeaderAndIsr state 
before resuming fetching.
 # 2021-11-25 12:24:26,557 [kafka-broker] - INFO  
[data-plane-kafka-request-handler-2:Logging@66] - [Partition 
wavefront-data-archival-stream-7 broker=1] No checkpointed highwatermark is 
found for partition wavefront-data-archival-stream-7
 # 2021-11-25 11:35:43,171 [kafka-broker] - WARN  
[broker-0-to-controller-send-thread:Logging@70] - Broker had a stale broker 
epoch (90194346998), retrying.
 # 2021-11-25 12:14:42,710 [kafka-broker] - WARN  
[ReplicaFetcherThread-0-0:Logging@72] - [ReplicaFetcher replicaId=1, 
leaderId=0, fetcherId=0] Error in response for fetch request

> Replication failing after unclean shutdown of ZK and all brokers
> ----------------------------------------------------------------
>
>                 Key: KAFKA-13077
>                 URL: https://issues.apache.org/jira/browse/KAFKA-13077
>             Project: Kafka
>          Issue Type: Bug
>    Affects Versions: 2.8.0
>            Reporter: Christopher Auston
>            Priority: Minor
>
> I am submitting this in the spirit of what can go wrong when an operator 
> violates the constraints Kafka depends on. I don't know if Kafka could or 
> should handle this more gracefully. I decided to file this issue because it 
> was easy to get the problem I'm reporting with Kubernetes StatefulSets (STS). 
> By "easy" I mean that I did not go out of my way to corrupt anything, I just 
> was not careful when restarting ZK and brokers.
> I violated the constraints of keeping Zookeeper stable and at least one 
> running in-sync replica. 
> I am running the bitnami/kafka helm chart on Amazon EKS.
> {quote}% kubectl get po kaf-kafka-0 -ojson |jq .spec.containers'[].image'
> "docker.io/bitnami/kafka:2.8.0-debian-10-r43"
> {quote}
> I started with 3 ZK instances and 3 brokers (both STS). I changed the 
> cpu/memory requests on both STS and kubernetes proceeded to restart ZK and 
> kafka instances at the same time. If I recall correctly there were some 
> crashes and several restarts but eventually all the instances were running 
> again. It's possible all ZK nodes and all brokers were unavailable at various 
> points.
> The problem I noticed was that two of the brokers were just continually 
> spitting out messages like:
> {quote}% kubectl logs kaf-kafka-0 --tail 10
> [2021-07-13 14:26:08,871] INFO [ProducerStateManager 
> partition=__transaction_state-0] Loading producer state from snapshot file 
> 'SnapshotFile(/bitnami/kafka/data/__transaction_state-0/00000000000000000001.snapshot,1)'
>  (kafka.log.ProducerStateManager)
> [2021-07-13 14:26:08,871] WARN [Log partition=__transaction_state-0, 
> dir=/bitnami/kafka/data] *Non-monotonic update of high watermark from 
> (offset=2744 segment=[0:1048644]) to (offset=1 segment=[0:169])* 
> (kafka.log.Log)
> [2021-07-13 14:26:08,874] INFO [Log partition=__transaction_state-10, 
> dir=/bitnami/kafka/data] Truncating to offset 2 (kafka.log.Log)
> [2021-07-13 14:26:08,877] INFO [Log partition=__transaction_state-10, 
> dir=/bitnami/kafka/data] Loading producer state till offset 2 with message 
> format version 2 (kafka.log.Log)
> [2021-07-13 14:26:08,877] INFO [ProducerStateManager 
> partition=__transaction_state-10] Loading producer state from snapshot file 
> 'SnapshotFile(/bitnami/kafka/data/__transaction_state-10/00000000000000000002.snapshot,2)'
>  (kafka.log.ProducerStateManager)
> [2021-07-13 14:26:08,877] WARN [Log partition=__transaction_state-10, 
> dir=/bitnami/kafka/data] Non-monotonic update of high watermark from 
> (offset=2930 segment=[0:1048717]) to (offset=2 segment=[0:338]) 
> (kafka.log.Log)
> [2021-07-13 14:26:08,880] INFO [Log partition=__transaction_state-20, 
> dir=/bitnami/kafka/data] Truncating to offset 1 (kafka.log.Log)
> [2021-07-13 14:26:08,882] INFO [Log partition=__transaction_state-20, 
> dir=/bitnami/kafka/data] Loading producer state till offset 1 with message 
> format version 2 (kafka.log.Log)
> [2021-07-13 14:26:08,882] INFO [ProducerStateManager 
> partition=__transaction_state-20] Loading producer state from snapshot file 
> 'SnapshotFile(/bitnami/kafka/data/__transaction_state-20/00000000000000000001.snapshot,1)'
>  (kafka.log.ProducerStateManager)
> [2021-07-13 14:26:08,883] WARN [Log partition=__transaction_state-20, 
> dir=/bitnami/kafka/data] Non-monotonic update of high watermark from 
> (offset=2956 segment=[0:1048608]) to (offset=1 segment=[0:169]) 
> (kafka.log.Log)
> {quote}
> If I describe that topic I can see that several partitions have a leader of 2 
> and the ISR is just 2 (NOTE I added two more brokers and tried to reassign 
> the topic onto brokers 2,3,4 which you can see below). The new brokers also 
> spit out the messages about "non-monotonic update" just like the original 
> followers. This describe output is from the following day.
> {{% kafka-topics.sh ${=BS} -topic __transaction_state -describe}}
> {{Topic: __transaction_state TopicId: i7bBNCeuQMWl-ZMpzrnMAw PartitionCount: 
> 50 ReplicationFactor: 3 Configs: 
> compression.type=uncompressed,min.insync.replicas=3,cleanup.policy=compact,flush.ms=1000,segment.bytes=104857600,flush.messages=10000,max.message.bytes=1000012,unclean.leader.election.enable=false,retention.bytes=1073741824}}
> {{ Topic: __transaction_state Partition: 0 Leader: 2 Replicas: 4,3,2,1,0 Isr: 
> 2 Adding Replicas: 4,3 Removing Replicas: 1,0}}
> {{ Topic: __transaction_state Partition: 1 Leader: 2 Replicas: 2,4,3 Isr: 
> 2,3,4}}
> {{ Topic: __transaction_state Partition: 2 Leader: 3 Replicas: 3,2,4 Isr: 
> 2,3,4}}
> {{ Topic: __transaction_state Partition: 3 Leader: 4 Replicas: 4,2,3 Isr: 
> 2,3,4}}
> {{ Topic: __transaction_state Partition: 4 Leader: 2 Replicas: 2,3,4 Isr: 
> 2,3,4}}
> {{ Topic: __transaction_state Partition: 5 Leader: 2 Replicas: 3,4,2 Isr: 
> 2,3,4}}
> {{ Topic: __transaction_state Partition: 6 Leader: 4 Replicas: 4,3,2 Isr: 
> 2,3,4}}
> {{ Topic: __transaction_state Partition: 7 Leader: 2 Replicas: 2,4,3 Isr: 
> 2,3,4}}
> {{ Topic: __transaction_state Partition: 8 Leader: 2 Replicas: 3,2,4,0,1 Isr: 
> 2 Adding Replicas: 3,4 Removing Replicas: 0,1}}
> {{ Topic: __transaction_state Partition: 9 Leader: 2 Replicas: 4,2,3,1,0 Isr: 
> 2 Adding Replicas: 4,3 Removing Replicas: 1,0}}
> {{ Topic: __transaction_state Partition: 10 Leader: 2 Replicas: 2,3,4,1,0 
> Isr: 2 Adding Replicas: 3,4 Removing Replicas: 1,0}}
> {{ Topic: __transaction_state Partition: 11 Leader: 3 Replicas: 3,4,2 Isr: 
> 2,3,4}}
> {{ Topic: __transaction_state Partition: 12 Leader: 4 Replicas: 4,3,2 Isr: 
> 2,3,4}}
> {{ Topic: __transaction_state Partition: 13 Leader: 2 Replicas: 2,4,3 Isr: 
> 2,3,4}}
> {{ Topic: __transaction_state Partition: 14 Leader: 3 Replicas: 3,2,4 Isr: 
> 2,3,4}}
> {{ Topic: __transaction_state Partition: 15 Leader: 4 Replicas: 4,2,3 Isr: 
> 2,3,4}}
> {{ Topic: __transaction_state Partition: 16 Leader: 2 Replicas: 2,3,4 Isr: 
> 2,3,4}}
> {{ Topic: __transaction_state Partition: 17 Leader: 2 Replicas: 3,4,2 Isr: 
> 2,3,4}}
> {{ Topic: __transaction_state Partition: 18 Leader: 4 Replicas: 4,3,2 Isr: 
> 2,3,4}}
> {{ Topic: __transaction_state Partition: 19 Leader: 2 Replicas: 2,4,3,0,1 
> Isr: 2 Adding Replicas: 4,3 Removing Replicas: 0,1}}
> {{ Topic: __transaction_state Partition: 20 Leader: 2 Replicas: 3,2,4,0,1 
> Isr: 2 Adding Replicas: 3,4 Removing Replicas: 0,1}}
> {{ Topic: __transaction_state Partition: 21 Leader: 2 Replicas: 4,2,3,1,0 
> Isr: 2 Adding Replicas: 4,3 Removing Replicas: 1,0}}
> {{ Topic: __transaction_state Partition: 22 Leader: 2 Replicas: 2,3,4,1,0 
> Isr: 2 Adding Replicas: 3,4 Removing Replicas: 1,0}}
> {{ Topic: __transaction_state Partition: 23 Leader: 3 Replicas: 3,4,2 Isr: 
> 2,3,4}}
> {{ Topic: __transaction_state Partition: 24 Leader: 4 Replicas: 4,3,2 Isr: 
> 2,3,4}}
> {{ Topic: __transaction_state Partition: 25 Leader: 2 Replicas: 2,4,3 Isr: 
> 2,3,4}}
> {{ Topic: __transaction_state Partition: 26 Leader: 3 Replicas: 3,2,4 Isr: 
> 2,3,4}}
> {{ Topic: __transaction_state Partition: 27 Leader: 4 Replicas: 4,2,3 Isr: 
> 2,3,4}}
> {{ Topic: __transaction_state Partition: 28 Leader: 2 Replicas: 2,3,4 Isr: 
> 2,3,4}}
> {{ Topic: __transaction_state Partition: 29 Leader: 3 Replicas: 3,4,2 Isr: 
> 2,3,4}}
> {{ Topic: __transaction_state Partition: 30 Leader: 4 Replicas: 4,3,2 Isr: 
> 2,3,4}}
> {{ Topic: __transaction_state Partition: 31 Leader: 2 Replicas: 2,4,3 Isr: 
> 2,3,4}}
> {{ Topic: __transaction_state Partition: 32 Leader: 3 Replicas: 3,2,4 Isr: 
> 2,3,4}}
> {{ Topic: __transaction_state Partition: 33 Leader: 4 Replicas: 4,2,3 Isr: 
> 2,3,4}}
> {{ Topic: __transaction_state Partition: 34 Leader: 2 Replicas: 2,3,4 Isr: 
> 2,3,4}}
> {{ Topic: __transaction_state Partition: 35 Leader: 3 Replicas: 3,4,2 Isr: 
> 2,3,4}}
> {{ Topic: __transaction_state Partition: 36 Leader: 4 Replicas: 4,3,2 Isr: 
> 2,3,4}}
> {{ Topic: __transaction_state Partition: 37 Leader: 2 Replicas: 2,4,3 Isr: 
> 2,3,4}}
> {{ Topic: __transaction_state Partition: 38 Leader: 3 Replicas: 3,2,4 Isr: 
> 2,3,4}}
> {{ Topic: __transaction_state Partition: 39 Leader: 4 Replicas: 4,2,3 Isr: 
> 2,3,4}}
> {{ Topic: __transaction_state Partition: 40 Leader: 2 Replicas: 2,3,4 Isr: 
> 2,3,4}}
> {{ Topic: __transaction_state Partition: 41 Leader: 3 Replicas: 3,4,2 Isr: 
> 2,3,4}}
> {{ Topic: __transaction_state Partition: 42 Leader: 4 Replicas: 4,3,2 Isr: 
> 2,3,4}}
> {{ Topic: __transaction_state Partition: 43 Leader: 2 Replicas: 2,4,3 Isr: 
> 2,3,4}}
> {{ Topic: __transaction_state Partition: 44 Leader: 3 Replicas: 3,2,4 Isr: 
> 2,3,4}}
> {{ Topic: __transaction_state Partition: 45 Leader: 4 Replicas: 4,2,3 Isr: 
> 2,3,4}}
> {{ Topic: __transaction_state Partition: 46 Leader: 2 Replicas: 2,3,4 Isr: 
> 2,3,4}}
> {{ Topic: __transaction_state Partition: 47 Leader: 3 Replicas: 3,4,2 Isr: 
> 2,3,4}}
> {{ Topic: __transaction_state Partition: 48 Leader: 4 Replicas: 4,3,2 Isr: 
> 2,3,4}}
> {{ Topic: __transaction_state Partition: 49 Leader: 2 Replicas: 2,4,3 Isr: 
> 2,3,4}}
>  
> It seems something got corrupted and the followers will never make progress. 
> Even worse the original followers appear to have truncated their copies, so 
> if the remaining leader replica is what is corrupted then it may have 
> truncated replicas that had more valid data?
> Anyway, for what it's worth, this is something that happened to me. I plan to 
> change the statefulsets to require manual restarts so I can control rolling 
> upgrades. It also seems to underscore having a separate Kafka cluster for 
> disaster recovery.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

Reply via email to