nic-6443 opened a new pull request, #13255:
URL: https://github.com/apache/apisix/pull/13255

   ## What
   
   Add an explicit `ngx.sleep(0)` at the end of each iteration of the streaming 
SSE loop in `apisix/plugins/ai-providers/base.lua::parse_streaming_response`. 
This guarantees the coroutine yields to the nginx scheduler at least once per 
upstream chunk.
   
   ## Why
   
   In production we observed worker processes pinned at 100% CPU during AI 
proxy traffic. Root cause: when an upstream LLM emits SSE chunks in a tight 
burst (e.g. a model hallucinating and producing tokens at 100+ per second, or 
upstreams that batch multiple SSE events into a single TCP segment), the 
streaming loop runs for an extended period without yielding.
   
   Specifically:
   
   - `body_reader()` (cosocket `socket:receive()`) only yields when the recv 
buffer is empty. If the kernel has already buffered several chunks, successive 
calls return immediately without yielding.
   - `ngx.flush(true)` (used downstream) only yields when the send buffer is 
full. A fast downstream client drains immediately, so flush returns without 
yielding.
   
   Neither end of the loop guarantees a yield. The result: the SSE coroutine 
monopolizes the worker — starving health checks, concurrent requests on the 
same worker, and timer callbacks. Even modest traffic can saturate a single 
core because Lua coroutines on the same OpenResty worker share one OS thread.
   
   `ngx.sleep(0)` is the canonical OpenResty primitive for this — it queues a 
0-second timer and yields the current coroutine, letting the scheduler pick up 
any other ready coroutines, then resumes.
   
   ## Cost
   
   - Normal traffic: chunks already arrive with inter-chunk gaps, so 
`body_reader()` already yields naturally between chunks. The extra 
`ngx.sleep(0)` is invisible.
   - Burst traffic: caps per-coroutine runtime to one chunk's worth of work 
between yields. The yield itself is microseconds.
   
   ## Test plan
   
   This is a concurrency / scheduling fix where deterministic reproduction in 
test-nginx is difficult — burst behavior depends on TCP buffering between the 
mock upstream and the proxy, both of which run in the same nginx instance 
during tests, so timing rarely matches the real-world scenario. Existing 
streaming correctness tests (`t/plugin/ai-proxy*.t`, 
`t/plugin/ai-proxy-client-disconnect.t`) cover that the loop still produces 
correct output and that the new yield doesn't break the disconnect-detection or 
limit-enforcement paths.
   
   Per the project's testing exception for "concurrency issues that are hard to 
simulate", I'm relying on existing tests for correctness regression coverage.
   
   ## Related
   
   This complements:
   
   - #13072 (handle misaligned HTTP chunk and SSE event boundaries) — addressed 
correctness
   - #13226 (abort upstream read on client disconnect) — addressed wasted work 
after disconnect
   
   This PR addresses the remaining failure mode: worker CPU starvation during 
normal (or hallucinating) bursty streams.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to