kaushik srinivas created KAFKA-13177: ----------------------------------------
Summary: partition failures and fewer shrink but a lot of isr expansions with increased num.replica.fetchers in kafka brokers Key: KAFKA-13177 URL: https://issues.apache.org/jira/browse/KAFKA-13177 Project: Kafka Issue Type: Bug Reporter: kaushik srinivas Installing 3 node kafka broker cluster (4 core cpu and 4Gi memory on k8s) topics : 15, partitions each : 15 replication factor 3, min.insync.replicas : 2 producers running with acks : all Initially the num.replica.fetchers was set to 1 (default) and we observed very frequent ISR shrinks and expansions. So the setups were tuned with a higher value of 4. Once after this change was done, we see below behavior and warning msgs in broker logs # Over a period of 2 days, there are around 10 shrinks corresponding to 10 partitions, but around 700 ISR expansions corresponding to almost all partitions in the cluster(approx 50 to 60 partitions). # we see frequent warn msg of partitions being marked as failure in the same time span. Below is the trace --> {"type":"log", "host":"wwwwww", "level":"WARN", "neid":"kafka-wwwwww", "system":"kafka", "time":"2021-08-03T20:09:15.340", "timezone":"UTC", "log":{"message":"ReplicaFetcherThread-2-1003 - kafka.server.ReplicaFetcherThread - *[ReplicaFetcher replicaId=1001, leaderId=1003, fetcherId=2] Partition test-16 marked as failed"}}* We see the above behavior continuously after increasing the num.replica.fetchers to 4 from 1. We did increase this to improve the replication performance and hence reduce the ISR shrinks. But we see this strange behavior after the change. What would the above trace indicate and is marking partitions as failed just a WARN msgs and handled by kafka or is it something to worry about ? -- This message was sent by Atlassian Jira (v8.3.4#803005)