[jira] [Comment Edited] (KAFKA-20000) Optimize retry backoff for CONCURRENT_TRANSACTIONS to improve TV2 throughput

sanghyeok An (Jira) Fri, 02 Jan 2026 19:18:07 -0800


    [ 
https://issues.apache.org/jira/browse/KAFKA-20000?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18048806#comment-18048806
 ]


sanghyeok An edited comment on KAFKA-20000 at 1/3/26 3:17 AM:
--------------------------------------------------------------

[~jolshan] [~chia7712] [~francisgodinho] 

Jumping in here to share my thoughts.

I think it would be better to keep the client implementation as is and focus on 
the server side.

Instead of modifying the client, we should implement the retry logic for 
{{CONCURRENT_TRANSACTIONS}} on the server side (specifically, within the Group 
Coordinator), similar to how the broker currently handles the produce path in 
TV2. It seems that while the Group Coordinator currently sends 
{{AddPartitionsToTxn}} requests to the Transaction Coordinator, it doesn't 
handle retries for {{CONCURRENT_TRANSACTIONS}} internally.

If we handle this on the server:
 # We can strictly reuse the {{add.partitions.to.txn.retry.backoff.ms}} config 
for {{TxnOffsetCommit}} as intended.

 # It aligns with the design goal of TV2 by keeping clients thin. This 
significantly simplifies the implementation for other Kafka clients (Go, Rust, 
Python, etc.), as they won't need to implement complex backoff logic for this 
specific error.

 

What do you think?


was (Author: JIRAUSER303328):
[~jolshan] [~chia7712] [~francisgodinho] 

 

I think it would be better to keep the client implementation as is and focus on 
the server side.

Instead of modifying the client, we should implement the retry logic for 
{{CONCURRENT_TRANSACTIONS}} on the server side (specifically, within the Group 
Coordinator), similar to how the broker currently handles the produce path in 
TV2. It seems that while the Group Coordinator currently sends 
{{AddPartitionsToTxn}} requests to the Transaction Coordinator, it doesn't 
handle retries for {{CONCURRENT_TRANSACTIONS}} internally.

If we handle this on the server:
 # We can strictly reuse the {{add.partitions.to.txn.retry.backoff.ms}} config 
for {{TxnOffsetCommit}} as intended.

 # It aligns with the design goal of TV2 by keeping clients thin. This 
significantly simplifies the implementation for other Kafka clients (Go, Rust, 
Python, etc.), as they won't need to implement complex backoff logic for this 
specific error.

 

What do you think?

> Optimize retry backoff for CONCURRENT_TRANSACTIONS to improve TV2 throughput
> ----------------------------------------------------------------------------
>
>                 Key: KAFKA-20000
>                 URL: https://issues.apache.org/jira/browse/KAFKA-20000
>             Project: Kafka
>          Issue Type: Improvement
>            Reporter: Chia-Ping Tsai
>            Assignee: Francis Godinho
>            Priority: Major
>             Fix For: 4.3.0
>
>
> Transaction V2 introduces frequent state transitions (epoch bumps) that 
> briefly reject concurrent requests with CONCURRENT_TRANSACTIONS. The default 
> client retry backoff (100ms) is excessive for these transient locks, leading 
> to unnecessary latency and degraded throughput. Reducing the backoff allows 
> faster retries and smoother performance during state transitions.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Comment Edited] (KAFKA-20000) Optimize retry backoff for CONCURRENT_TRANSACTIONS to improve TV2 throughput

Reply via email to