[
https://issues.apache.org/jira/browse/RATIS-2408?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Shilun Fan updated RATIS-2408:
------------------------------
Description:
*Problem*
Currently, the Netty DataStream client uses a fixed 100ms delay for
reconnection attempts when the connection fails. This approach has several
limitations:
1. *{*}Resource waste{*}*: During network issues or server unavailability,
constant 100ms retry intervals create unnecessary load
2. *{*}Thundering herd{*}*: Multiple clients reconnecting simultaneously can
overwhelm the server
3. *{*}Lack of configurability{*}*: Users cannot tune reconnection behavior for
their specific use cases
*Solution*
Implement configurable exponential backoff with jitter for DataStream client
reconnections:
1. *{*}Configuration Support{*}*:
- `raft.client.datastream.reconnect.delay` - Initial reconnection delay
(default: 100ms)
- `raft.client.datastream.reconnect.max-delay` - Maximum backoff delay
(default: 5s)
2. *{*}Exponential Backoff{*}*:
- Delay doubles on each failed attempt: 100ms → 200ms → 400ms → 800ms → 1600ms
→ 5000ms
- Resets to initial delay upon successful connection
3. *{*}Jitter (0.5x-1.5x){*}*:
- Randomizes actual delay to avoid synchronized reconnection storms
- Example: 1000ms base → actual delay between 500ms-1500ms
4. *{*}Concurrent Safety{*}*:
- Prevents duplicate reconnection scheduling using atomic flags
- Ensures cleanup even if reconnection is short-circuited
5. *{*}Adaptive Logging{*}*:
- INFO level for short delays (≤500ms) - normal reconnection
- WARN level for long delays (>500ms) - persistent failures
was:
## Problem
Currently, the Netty DataStream client uses a fixed 100ms delay for
reconnection attempts when the connection fails. This approach has several
limitations:
1. **Resource waste**: During network issues or server unavailability, constant
100ms retry intervals create unnecessary load
2. **Thundering herd**: Multiple clients reconnecting simultaneously can
overwhelm the server
3. **Lack of configurability**: Users cannot tune reconnection behavior for
their specific use cases
## Solution
Implement configurable exponential backoff with jitter for DataStream client
reconnections:
1. **Configuration Support**:
- `raft.client.datastream.reconnect.delay` - Initial reconnection delay
(default: 100ms)
- `raft.client.datastream.reconnect.max-delay` - Maximum backoff delay
(default: 5s)
2. **Exponential Backoff**:
- Delay doubles on each failed attempt: 100ms → 200ms → 400ms → 800ms → 1600ms
→ 5000ms
- Resets to initial delay upon successful connection
3. **Jitter (0.5x-1.5x)**:
- Randomizes actual delay to avoid synchronized reconnection storms
- Example: 1000ms base → actual delay between 500ms-1500ms
4. **Concurrent Safety**:
- Prevents duplicate reconnection scheduling using atomic flags
- Ensures cleanup even if reconnection is short-circuited
5. **Adaptive Logging**:
- INFO level for short delays (≤500ms) - normal reconnection
- WARN level for long delays (>500ms) - persistent failures
> Add configurable exponential backoff reconnection for Netty DataStream client
> -----------------------------------------------------------------------------
>
> Key: RATIS-2408
> URL: https://issues.apache.org/jira/browse/RATIS-2408
> Project: Ratis
> Issue Type: Improvement
> Components: Netty
> Reporter: Shilun Fan
> Assignee: Shilun Fan
> Priority: Major
>
> *Problem*
>
> Currently, the Netty DataStream client uses a fixed 100ms delay for
> reconnection attempts when the connection fails. This approach has several
> limitations:
> 1. *{*}Resource waste{*}*: During network issues or server unavailability,
> constant 100ms retry intervals create unnecessary load
> 2. *{*}Thundering herd{*}*: Multiple clients reconnecting simultaneously can
> overwhelm the server
> 3. *{*}Lack of configurability{*}*: Users cannot tune reconnection behavior
> for their specific use cases
>
>
> *Solution*
> Implement configurable exponential backoff with jitter for DataStream client
> reconnections:
> 1. *{*}Configuration Support{*}*:
> - `raft.client.datastream.reconnect.delay` - Initial reconnection delay
> (default: 100ms)
> - `raft.client.datastream.reconnect.max-delay` - Maximum backoff delay
> (default: 5s)
> 2. *{*}Exponential Backoff{*}*:
> - Delay doubles on each failed attempt: 100ms → 200ms → 400ms → 800ms →
> 1600ms → 5000ms
> - Resets to initial delay upon successful connection
> 3. *{*}Jitter (0.5x-1.5x){*}*:
> - Randomizes actual delay to avoid synchronized reconnection storms
> - Example: 1000ms base → actual delay between 500ms-1500ms
> 4. *{*}Concurrent Safety{*}*:
> - Prevents duplicate reconnection scheduling using atomic flags
> - Ensures cleanup even if reconnection is short-circuited
> 5. *{*}Adaptive Logging{*}*:
> - INFO level for short delays (≤500ms) - normal reconnection
> - WARN level for long delays (>500ms) - persistent failures
--
This message was sent by Atlassian Jira
(v8.20.10#820010)