AlinsRan opened a new pull request, #13495:
URL: https://github.com/apache/apisix/pull/13495

   ### Description
   
   This PR adds two optional controls to the `ai-proxy-multi` fallback/retry 
mechanism, so that retry behavior better matches the cost and latency 
characteristics of AI requests.
   
   When multiple LLM instances are configured (multi-provider / multi-region / 
multi-key for failover), `fallback_strategy` lets the plugin switch to the next 
instance on `429`/`5xx`/connection errors. However, AI requests are usually 
long-running, so two problems arise:
   
   1. **Slow failures double the client wait.** If an upstream takes minutes 
before returning a `5xx`, switching to another instance and re-running the 
request makes the client wait even longer and wastes provider quota / gateway 
resources.
   2. **Many instances can be exhausted in a single request.** With several 
instances, one request may try them all, making the total latency unbounded.
   
   This PR introduces:
   
   - **`max_retries`** (integer, optional): maximum number of fallback retries 
after the initial request fails. Bounds how many additional instances a single 
request tries.
   - **`retry_on_failure_within_ms`** (integer, optional): only fall back when 
the upstream fails within this many milliseconds. Fast failures (connection 
errors, quick `429`/`5xx`) are retried; slow failures that take longer are 
returned to the client directly to avoid doubling the wait time.
   
   Both are optional and only take effect together with `fallback_strategy`. 
When unset, behavior is identical to today, so existing users are unaffected.
   
   #### Example
   
   ```json
   {
     "ai-proxy-multi": {
       "fallback_strategy": ["http_429", "http_5xx"],
       "max_retries": 1,
       "retry_on_failure_within_ms": 5000,
       "timeout": 300000,
       "instances": [ ... ]
     }
   }
   ```
   
   - Only failures within `5000` ms enter fallback/retry.
   - `max_retries: 1` allows at most one retry beyond the initial request.
   - A `5xx` returned after 5 minutes is returned directly without retrying.
   
   #### Which issue(s) this PR fixes:
   
   N/A (internal ticket)
   
   ### Checklist
   
   - [x] I have explained the need for this PR and the problem it solves
   - [x] I have explained the changes or the new features added to this PR
   - [x] I have added tests corresponding to this change
   - [x] I have updated the documentation to reflect this change
   - [x] I have verified that this change is backward compatible
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to