nic-6443 opened a new pull request, #13481:
URL: https://github.com/apache/apisix/pull/13481
### Description
When an AI proxy request times out reaching the upstream LLM, the client
receives `500 Internal Server Error` instead of `504 Gateway Timeout`.
`apisix/plugins/ai-transport/http.lua` maps errors to `504` only when the
error string contains the contiguous substring `timeout`:
```lua
function _M.handle_error(err)
if core.string.find(err, "timeout") then
return 504
end
return 500
end
```
`core.string.find` is a plain-text search, and an OS connect timeout
surfaces as `Operation timed out` / `Connection timed out`, which contains
`timed out` (with a space), **not** the contiguous `timeout`. So it falls
through to the default `500`.
#### Timeout error taxonomy
`handle_error` receives its error from `lua-resty-http` over OpenResty
cosockets. lua-resty-http propagates the raw cosocket error unchanged (`return
nil, err` in every connect/send/receive/body-reader path), so the only timeout
spellings that can reach `handle_error` are produced by the cosocket layer and
the nginx resolver:
| Scenario | Source | String reaching `handle_error` | matched by |
|---|---|---|:--:|
| connect / send / read / streaming-read deadline (the `set_timeout` value
fires) | cosocket timer (`ngx_http_lua_socket_tcp.c`, `FT_TIMEOUT`) | `timeout`
| `timeout` |
| kernel connect `ETIMEDOUT`, Linux | errno strerror (lowercased by ngx_lua)
| `connection timed out` | `timed out` |
| kernel connect `ETIMEDOUT`, macOS/BSD | errno strerror (lowercased by
ngx_lua) | `operation timed out` | `timed out` |
| DNS resolver timeout | `ngx_resolver_strerror` (hardcoded) | `Operation
timed out` | `timed out` |
The complete timeout set collapses to exactly two substrings: `timeout`
(cosocket timer) and `timed out` (errno / resolver). Matching both is
**necessary** (the bare `timeout` case has no space; the `… timed out` cases
have no contiguous `timeout`) and **sufficient**. Non-timeout errors
(`connection refused`, `connection reset by peer`, `closed`) keep returning
`500`.
### Fix
Match `timed out` in addition to `timeout`:
```lua
function _M.handle_error(err)
if core.string.find(err, "timeout") or core.string.find(err, "timed
out") then
return 504
end
return 500
end
```
`handle_error` is the single status mapper behind all AI upstream failure
paths (ai-proxy request/metric, ai-providers sidecar request and streaming
read), so all of them now return `504` on timeout.
### Tests
Added two cases to `t/plugin/ai-transport-http.t`:
- a regression test that mocks `connect` returning `Operation timed out` and
asserts the mapped status is `504`;
- a matrix over one representative per timeout class plus non-timeout
controls.
Both fail before the change (the `… timed out` cases return `500`) and pass
after.
### Checklist
- [x] I have explained the need for this PR and the problem it solves
- [x] I have explained the changes or the new features added to this PR
- [x] I have added tests corresponding to this change
- [x] I have updated the documentation to reflect this change (N/A —
internal error-code mapping, no documented behavior change)
- [x] I have verified that this change is backward compatible (timeouts
previously returned 500; they now return the more accurate 504)
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]