mlowicki created KAFKA-16371: -------------------------------- Summary: Unstable committed offsets after triggering commits where some metadata for some partitions are over the limit Key: KAFKA-16371 URL: https://issues.apache.org/jira/browse/KAFKA-16371 Project: Kafka Issue Type: Bug Components: offset manager Affects Versions: 3.7.0 Reporter: mlowicki
Issue is reproducible with simple CLI tool - [https://gist.github.com/mlowicki/c3b942f5545faced93dc414e01a2da70] {code:java} #!/usr/bin/env bash for i in {1..100} do kafka-committer --bootstrap "ADDR:9092" --topic "TOPIC" --group bar --metadata-min 6000 --metadata-max 10000 --partitions 72 --fetch done{code} What it does it that initially it fetches committed offsets and then tries to commit for multiple partitions. If some of commits have metadata over the allowed limit then: 1. I see errors about too large commits (expected) 2. Another run the tool fails at the stage of fetching commits with (this is the problem): {code:java} config: ClientConfig { conf_map: { "group.id": "bar", "bootstrap.servers": "ADDR:9092", }, log_level: Error, } fetching committed offsets.. Error: Meta data fetch error: OperationTimedOut (Local: Timed out) Caused by: OperationTimedOut (Local: Timed out){code} On the Kafka side I see _unstable_offset_commits_ errors reported by metric {code:java} kafka.net.error_rate{code} Increasing the timeout doesn't help and the only solution I've found is to trigger commits for all partitions with metadata below the limit or to use: {code:java} isolation.level=read_uncommitted{code} I don't know that code very but [https://github.com/apache/kafka/blob/trunk/core/src/main/scala/kafka/coordinator/group/GroupMetadataManager.scala#L495-L500] seems fishy as it's using _offsetMetadata_ and not _filteredOffsetMetadata_ and I see that while removing those pending commits we use filtered offset metadata around [https://github.com/apache/kafka/blob/trunk/core/src/main/scala/kafka/coordinator/group/GroupMetadataManager.scala#L400-L425 |https://github.com/apache/kafka/blob/trunk/core/src/main/scala/kafka/coordinator/group/GroupMetadataManager.scala#L400-L425]so the problem might be related to not cleaning up the data structure with pending commits properly. -- This message was sent by Atlassian Jira (v8.20.10#820010)