Said BOUDJELDA created KAFKA-20113:
--------------------------------------

             Summary: Add Configurable Retry Parameters for Status Backing Store
                 Key: KAFKA-20113
                 URL: https://issues.apache.org/jira/browse/KAFKA-20113
             Project: Kafka
          Issue Type: New Feature
          Components: connect
            Reporter: Said BOUDJELDA
            Assignee: Said BOUDJELDA
             Fix For: 4.2.0


Implement configurable retry parameters for the +KafkaStatusBackingStore+
to address the TODO comment "retry more gracefully and not forever" and provide 
operators with control over retry behavior during transient failures.
h3. Problem Statement
 
KafkaStatusBackingStore currently retries status updates indefinitely when 
encountering retriable exceptions. This behavior is problematic because:
 # *Infinite retry loops* can cause the worker to become unresponsive during 
extended Kafka broker outages
 # *No visibility* into retry behavior - operators cannot tune retry parameters 
based on their environment
 # *Resource exhaustion* - indefinite retries can consume threads and memory 
during prolonged failures
 # *No graceful degradation* - the system continues retrying without bound 
rather than failing fast when appropriate

A TODO comment in the codebase ({{{}// TODO: retry more gracefully and not 
forever{}}}) explicitly acknowledges this issue needs addressing.
h3. Proposed Solution

Add four new configuration properties under the {{status.storage.}} prefix to 
control retry behavior:

 
||Property||Type||Default||Description||
|{{status.storage.retry.max.retries}}|INT|5|Maximum number of retry attempts 
before giving up|
|{{status.storage.retry.initial.backoff.ms}}|LONG|300|Initial backoff delay in 
milliseconds|
|{{status.storage.retry.max.backoff.ms}}|LONG|10000|Maximum backoff delay cap 
in milliseconds|
|{{status.storage.retry.backoff.multiplier}}|DOUBLE|2.0|Multiplier applied to 
backoff after each attempt|

The retry mechanism uses *exponential backoff with jitter* to prevent 
thundering herd problems during cluster recovery.
h4. Behavior
 * Retries occur only for exceptions marked as {{RetriableException}}
 * After exhausting {{{}max.retries{}}}, the operation logs an error and 
terminates gracefully
 * All retry attempts are logged at WARN level with attempt count and delay 
information
 * Non-retriable exceptions fail immediately without retry
 *  

h3. Benefits
 # *Predictable failure modes* - Workers eventually give up and surface errors 
instead of hanging
 # *Operator control* - Tune retry behavior based on environment characteristics
 # *Better observability* - Clear logging of retry attempts and outcomes
 # *Backward compatible* - Default values maintain similar behavior to current 
implementation

 

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to