[ 
https://issues.apache.org/jira/browse/KAFKA-16371?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

mlowicki updated KAFKA-16371:
-----------------------------
    Summary: Unstable committed offsets after triggering commits where metadata 
for some partitions are over the limit  (was: Unstable committed offsets after 
triggering commits where some metadata for some partitions are over the limit)

> Unstable committed offsets after triggering commits where metadata for some 
> partitions are over the limit
> ---------------------------------------------------------------------------------------------------------
>
>                 Key: KAFKA-16371
>                 URL: https://issues.apache.org/jira/browse/KAFKA-16371
>             Project: Kafka
>          Issue Type: Bug
>          Components: offset manager
>    Affects Versions: 3.7.0
>            Reporter: mlowicki
>            Priority: Major
>
> Issue is reproducible with simple CLI tool - 
> [https://gist.github.com/mlowicki/c3b942f5545faced93dc414e01a2da70]
>  
>  
> {code:java}
> #!/usr/bin/env bash
> for i in {1..100}
> do
>     kafka-committer --bootstrap "ADDR:9092" --topic "TOPIC" --group bar 
> --metadata-min 6000 --metadata-max 10000 --partitions 72 --fetch
> done{code}
>  
>  
> What it does it that initially it fetches committed offsets and then tries to 
> commit for multiple partitions. If some of commits have metadata over the 
> allowed limit then:
> 1. I see errors about too large commits (expected)
> 2. Another run the tool fails at the stage of fetching commits with (this is 
> the problem):
>  
>  
> {code:java}
> config: ClientConfig { conf_map: { "group.id": "bar", "bootstrap.servers": 
> "ADDR:9092", }, log_level: Error, }
> fetching committed offsets..
> Error: Meta data fetch error: OperationTimedOut (Local: Timed out) Caused by: 
> OperationTimedOut (Local: Timed out){code}
>  
> On the Kafka side I see _unstable_offset_commits_ errors reported by metric 
> {code:java}
> kafka.net.error_rate{code}
> Increasing the timeout doesn't help and the only solution I've found is to 
> trigger commits for all partitions with metadata below the limit or to use: 
> {code:java}
> isolation.level=read_uncommitted{code}
>  
>  
> I don't know that code very but 
> [https://github.com/apache/kafka/blob/trunk/core/src/main/scala/kafka/coordinator/group/GroupMetadataManager.scala#L495-L500]
>  seems fishy as it's using _offsetMetadata_ and not _filteredOffsetMetadata_ 
> and I see that while removing those pending commits we use filtered offset 
> metadata around 
> [https://github.com/apache/kafka/blob/trunk/core/src/main/scala/kafka/coordinator/group/GroupMetadataManager.scala#L400-L425
>  
> |https://github.com/apache/kafka/blob/trunk/core/src/main/scala/kafka/coordinator/group/GroupMetadataManager.scala#L400-L425]so
>  the problem might be related to not cleaning up the data structure with 
> pending commits properly.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to