kaushik srinivas created KAFKA-13177:
----------------------------------------

             Summary: partition failures and fewer shrink but a lot of isr 
expansions with increased num.replica.fetchers in kafka brokers
                 Key: KAFKA-13177
                 URL: https://issues.apache.org/jira/browse/KAFKA-13177
             Project: Kafka
          Issue Type: Bug
            Reporter: kaushik srinivas


Installing 3 node kafka broker cluster (4 core cpu and 4Gi memory on k8s)

topics : 15, partitions each : 15 replication factor 3, min.insync.replicas  : 2

producers running with acks : all

Initially the num.replica.fetchers was set to 1 (default) and we observed very 
frequent ISR shrinks and expansions. So the setups were tuned with a higher 
value of 4. 

Once after this change was done, we see below behavior and warning msgs in 
broker logs
 # Over a period of 2 days, there are around 10 shrinks corresponding to 10 
partitions, but around 700 ISR expansions corresponding to almost all 
partitions in the cluster(approx 50 to 60 partitions).
 # we see frequent warn msg of partitions being marked as failure in the same 
time span. Below is the trace --> {"type":"log", "host":"wwwwww", 
"level":"WARN", "neid":"kafka-wwwwww", "system":"kafka", 
"time":"2021-08-03T20:09:15.340", "timezone":"UTC", 
"log":{"message":"ReplicaFetcherThread-2-1003 - 
kafka.server.ReplicaFetcherThread - *[ReplicaFetcher replicaId=1001, 
leaderId=1003, fetcherId=2] Partition test-16 marked as failed"}}*

 

We see the above behavior continuously after increasing the 
num.replica.fetchers to 4 from 1. We did increase this to improve the 
replication performance and hence reduce the ISR shrinks.

But we see this strange behavior after the change. What would the above trace 
indicate and is marking partitions as failed just a WARN msgs and handled by 
kafka or is it something to worry about ? 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to