[jira] [Created] (KAFKA-16371) Unstable committed offsets after triggering commits where some metadata for some partitions are over the limit

mlowicki (Jira) Thu, 14 Mar 2024 03:03:34 -0700

mlowicki created KAFKA-16371:
--------------------------------

             Summary: Unstable committed offsets after triggering commits where 
some metadata for some partitions are over the limit
                 Key: KAFKA-16371
                 URL: https://issues.apache.org/jira/browse/KAFKA-16371
             Project: Kafka
          Issue Type: Bug
          Components: offset manager
    Affects Versions: 3.7.0
            Reporter: mlowicki



Issue is reproducible with simple CLI tool - 
[https://gist.github.com/mlowicki/c3b942f5545faced93dc414e01a2da70]

 

 
{code:java}
#!/usr/bin/env bash

for i in {1..100}
do
    kafka-committer --bootstrap "ADDR:9092" --topic "TOPIC" --group bar 
--metadata-min 6000 --metadata-max 10000 --partitions 72 --fetch
done{code}
 

 

What it does it that initially it fetches committed offsets and then tries to 
commit for multiple partitions. If some of commits have metadata over the 
allowed limit then:
1. I see errors about too large commits (expected)

2. Another run the tool fails at the stage of fetching commits with (this is 
the problem):

 

 
{code:java}
config: ClientConfig { conf_map: { "group.id": "bar", "bootstrap.servers": 
"ADDR:9092", }, log_level: Error, }
fetching committed offsets..
Error: Meta data fetch error: OperationTimedOut (Local: Timed out) Caused by: 
OperationTimedOut (Local: Timed out){code}
 


On the Kafka side I see _unstable_offset_commits_ errors reported by metric 
{code:java}
kafka.net.error_rate{code}

Increasing the timeout doesn't help and the only solution I've found is to 
trigger commits for all partitions with metadata below the limit or to use: 
{code:java}
isolation.level=read_uncommitted{code}
 

 

I don't know that code very but 
[https://github.com/apache/kafka/blob/trunk/core/src/main/scala/kafka/coordinator/group/GroupMetadataManager.scala#L495-L500]
 seems fishy as it's using _offsetMetadata_ and not _filteredOffsetMetadata_ 
and I see that while removing those pending commits we use filtered offset 
metadata around 
[https://github.com/apache/kafka/blob/trunk/core/src/main/scala/kafka/coordinator/group/GroupMetadataManager.scala#L400-L425
 
|https://github.com/apache/kafka/blob/trunk/core/src/main/scala/kafka/coordinator/group/GroupMetadataManager.scala#L400-L425]so
 the problem might be related to not cleaning up the data structure with 
pending commits properly.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (KAFKA-16371) Unstable committed offsets after triggering commits where some metadata for some partitions are over the limit

Reply via email to