HaoTien opened a new issue, #12763:
URL: https://github.com/apache/apisix/issues/12763
### Description
# feat: As a user, I want error ratio-based circuit breaking in api-breaker
plugin, so that I can have more intelligent circuit breaking based on error
rates instead of just failure counts
## Description
Currently, the `api-breaker` plugin only supports failure count-based
circuit breaking (`unhealthy-count` policy), which triggers circuit breaker
when consecutive failure count reaches a threshold. This approach may not be
suitable for all scenarios, especially when dealing with varying traffic
patterns.
I would like to propose adding an error ratio-based circuit breaking policy
(`unhealthy-ratio`) that triggers circuit breaker based on error rate within a
sliding time window, providing more intelligent and adaptive circuit breaking
behavior.
## Motivation
### Current Limitations
- The existing failure count-based approach only considers consecutive
failures
- It doesn't account for the overall error rate in relation to total requests
- May be too sensitive during low traffic periods or not sensitive enough
during high traffic periods
### Benefits of Error Ratio-based Circuit Breaking
- More accurate representation of service health by considering error rate
rather than just failure count
- Better handling of varying traffic patterns
- Configurable sliding time window for flexible error rate calculation
- Support for circuit breaker states: CLOSED, OPEN, and HALF_OPEN
## Proposed Solution
Add a new `policy` parameter to the `api-breaker` plugin with two options:
- `unhealthy-count` (default, existing behavior)
- `unhealthy-ratio` (new error ratio-based policy)
### New Configuration Parameters for `unhealthy-ratio` Policy
| Parameter | Type | Default | Description |
|-----------|------|---------|-------------|
| `policy` | string | `"unhealthy-count"` | Circuit breaker policy |
| `unhealthy.error_ratio` | number | `0.5` | Error rate threshold (0-1) to
trigger circuit breaker |
| `unhealthy.min_request_threshold` | integer | `10` | Minimum requests
needed before evaluating error rate |
| `unhealthy.sliding_window_size` | integer | `300` | Sliding window size in
seconds for error rate calculation |
| `unhealthy.permitted_number_of_calls_in_half_open_state` | integer | `3` |
Number of permitted calls in half-open state |
| `healthy.success_ratio` | number | `0.6` | Success rate threshold to close
circuit breaker from half-open state |
### Example Configuration
```json
{
"plugins": {
"api-breaker": {
"break_response_code": 503,
"policy": "unhealthy-ratio",
"max_breaker_sec": 60,
"unhealthy": {
"http_statuses": [500, 502, 503, 504],
"error_ratio": 0.5,
"min_request_threshold": 10,
"sliding_window_size": 300,
"permitted_number_of_calls_in_half_open_state": 3
},
"healthy": {
"http_statuses": [200, 201, 202],
"success_ratio": 0.6
}
}
}
}
```
## Implementation Details
### Circuit Breaker States
- **CLOSED**: Normal request forwarding
- **OPEN**: Direct circuit breaker response without forwarding requests
- **HALF_OPEN**: Limited requests allowed to test service recovery
### Algorithm
1. Track requests and errors within a sliding time window
2. When request count ≥ `min_request_threshold` and error rate ≥
`error_ratio`, open circuit breaker
3. After `max_breaker_sec`, transition to half-open state
4. In half-open state, allow up to
`permitted_number_of_calls_in_half_open_state` requests
5. If sufficient successful requests, close circuit breaker; otherwise,
reopen
## Backward Compatibility
This enhancement is fully backward compatible:
- Existing configurations continue to work without changes
- Default `policy` is `"unhealthy-count"` (existing behavior)
- No breaking changes to existing APIs
## Testing
Comprehensive test coverage will be provided including:
- Schema validation tests for new parameters
- Functional tests for error ratio calculation
- Circuit breaker state transition tests
- Integration tests with various traffic patterns
- Backward compatibility tests
## Use Cases
1. **High-traffic services**: Better handling of error spikes in high-volume
scenarios
2. **Variable traffic patterns**: Adaptive behavior for services with
fluctuating request rates
3. **Microservices architectures**: More precise circuit breaking for
service mesh environments
4. **SLA-based circuit breaking**: Configure circuit breaker based on
acceptable error rates
## Files to be Modified
- `apisix/plugins/api-breaker.lua` - Core plugin logic
- `t/plugin/api-breaker.t` - Test cases (new test file for ratio-based tests)
- `docs/en/latest/plugins/api-breaker.md` - English documentation
- `docs/zh/latest/plugins/api-breaker.md` - Chinese documentation
## Additional Information
This feature has been implemented and tested locally. I'm ready to submit a
PR with:
- Complete implementation of the error ratio-based circuit breaking
- Comprehensive test suite following APISIX testing standards
- Updated documentation in both English and Chinese
- Backward compatibility preservation
Would appreciate feedback on this proposal and guidance on the contribution
process.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]