[PR] feat(ai-proxy): add max_stream_duration_ms and max_response_bytes safeguards [apisix]

via GitHub Fri, 17 Apr 2026 07:48:55 -0700


nic-6443 opened a new pull request, #13250:
URL: https://github.com/apache/apisix/pull/13250


   ### What this does
   
   Adds two opt-in configuration knobs to `ai-proxy` and `ai-proxy-multi` to 
protect the gateway from a runaway upstream LLM service:
   
   - `max_stream_duration_ms` — wall-clock cap on total streaming response 
duration.
   - `max_response_bytes` — cap on total bytes read from the upstream for a 
single response (streaming or non-streaming).
   
   Both are opt-in (no default) — existing deployments are unaffected.
   
   ### Why
   
   The existing `timeout` field is fed to `httpc:set_timeout()`, which is a 
per-socket-operation timeout (connect / send / read-one-block). It does **not** 
bound the total duration of a streaming response. If an upstream LLM has a bug 
that causes it to continuously emit valid SSE tokens without ever sending a 
terminator (`[DONE]`, `message_stop`, `response.completed`), 
`parse_streaming_response` sits in an uncapped `while true` loop, pinning the 
worker at ~100% CPU indefinitely and degrading availability for all other 
traffic on that worker.
   
   ### Behavior on abort
   
   - **Streaming, limit hit mid-stream (bytes already flushed):** stop feeding 
chunks and force-close the upstream httpc (`close()` + `res._httpc = nil`, so 
we don't pool a half-drained connection). nginx closes the downstream 
connection at end of content phase. The client detects truncation via the 
missing protocol-specific terminator. We intentionally do **not** synthesize a 
per-protocol "graceful error" SSE frame: we support three client protocols 
(OpenAI chat, Anthropic messages, OpenAI responses) with different terminators, 
and a missing terminator is the standard SSE way any mid-stream network failure 
is communicated to clients.
   - **Streaming, limit hit before any output:** return `504` (duration) or 
`502` (size) so `on_error` / fallback / retry hooks can kick in like any other 
upstream failure.
   - **Non-streaming, `Content-Length` exceeds cap:** pre-check the header, 
force-close the connection, return `502` without ever reading the body.
   - **Non-streaming, chunked / no `Content-Length`:** post-read size check 
catches the oversized body and returns `502`.
   - `ctx.var.llm_request_done = true` is set on abort so downstream filters 
(e.g. moderation plugins that defer work until completion) finalize their state.
   - A `core.log.warn` line is emitted on every abort (`aborting AI stream: 
<limit> exceeded; bytes=X duration_ms=Y route_id=Z`) so log-based alerting can 
surface the event. No new Prometheus metric — the log line is sufficient and 
avoids expanding the plugin's metric surface.
   
   ### Caveat (documented)
   
   Both limits are best-effort: they are enforced after each chunk is read from 
the upstream, so the byte cap can overshoot by up to one upstream chunk (≈8 KiB 
in practice) and the duration cap can overshoot by up to one chunk's processing 
time. This is acceptable for the failure mode we are defending against (runaway 
streams produce tens of MB/s, so a one-chunk overshoot is negligible compared 
to "run forever").
   
   ### Testing
   
   New `t/plugin/ai-proxy-stream-limits.t` with a mock upstream that either 
streams OpenAI chat SSE chunks forever (no `[DONE]`) or returns a 100 KB body 
with matching `Content-Length`. Covers:
   
   1. `max_stream_duration_ms=500` → request aborted in <5 s with the expected 
log line.
   2. `max_response_bytes=2048` → request aborted in <5 s with the expected log 
line.
   3. Non-streaming `max_response_bytes=1024` vs 100 KB upstream response → 502 
+ expected log line.
   4. Schema validation rejects `max_stream_duration_ms: 0`.
   
   `luacheck` passes on all three modified Lua files.
   
   ### Docs
   
   Added rows to the config tables in `docs/en/latest/plugins/ai-proxy.md`, 
`ai-proxy-multi.md`, and their Chinese translations, with a clarifying note 
that `timeout` only bounds per-socket-operation timeouts and the new fields are 
needed to bound total stream duration / total bytes read.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[PR] feat(ai-proxy): add max_stream_duration_ms and max_response_bytes safeguards [apisix]

Reply via email to