hanzhenfang commented on issue #13474:
URL: https://github.com/apache/apisix/issues/13474#issuecomment-4670722718
Hi, @Baoyuantop I built a smaller mock-only reproduction for the
intermittent `ai-proxy` timeout. It does not call any real LLM provider.
The mock upstream randomly chooses one behavior for each new TCP connection:
proxy to a local HTTPS OpenAI-compatible mock and return HTTP 200, or hang
before TLS response so APISIX times out inside `httpc:connect()`.
The `ai-proxy` route below sets `keepalive: false` intentionally, so every
request opens a fresh upstream TLS connection. This makes the intermittent
connect/TLS-handshake failure reproducible in a short local test. With
keepalive enabled, one successful connection can be reused and hide later
connect attempts.
## Files
### docker-compose.yaml
```yaml
name: issue-13474
services:
apisix:
image: ${APISIX_IMAGE:-apache/apisix:3.16.0-debian}
container_name: issue-13474-apisix
restart: "no"
depends_on:
- llm-random-tls
environment:
APISIX_STAND_ALONE: "true"
volumes:
- ./config.yaml:/usr/local/apisix/conf/config.yaml:ro
- ./apisix.yaml:/usr/local/apisix/conf/apisix.yaml:ro
ports:
- "19080:9080"
- "19090:9090"
llm-random-tls:
image: apache/apisix:3.16.0-debian
container_name: issue-13474-llm-random-tls
restart: "no"
environment:
MOCK_FAIL_RATE: "30"
volumes:
- ./mock-random-tls-nginx.conf:/tmp/mock-random-tls-nginx.conf:ro
command:
- sh
- -c
- >
mkdir -p /tmp/random-tls/logs &&
openssl req -x509 -newkey rsa:2048 -nodes -days 1
-subj '/CN=llm-random-tls'
-keyout /tmp/random-tls/server.key
-out /tmp/random-tls/server.crt &&
openresty -p /tmp/random-tls -c /tmp/mock-random-tls-nginx.conf -g
'daemon off;'
```
### config.yaml
```yaml
apisix:
node_listen:
- 9080
enable_admin: false
nginx_config:
error_log_level: info
worker_processes: 1
deployment:
role: data_plane
role_data_plane:
config_provider: yaml
control:
ip: 0.0.0.0
port: 9090
```
### apisix.yaml
```yaml
routes:
- id: 1
uri: /random-tls/v1/chat/completions
methods:
- POST
plugins:
ai-proxy:
provider: openai-compatible
timeout: 1000
keepalive: false
keepalive_timeout: 60000
keepalive_pool: 30
ssl_verify: false
auth:
header:
Authorization: Bearer mock-random-tls-key
options:
model: random-tls-mock
override:
endpoint: https://llm-random-tls:9444/v1/chat/completions
#END
```
### mock-random-tls-nginx.conf
```nginx
env MOCK_FAIL_RATE;
worker_processes 1;
error_log /dev/stderr info;
pid /tmp/random-tls/nginx.pid;
events {
worker_connections 1024;
}
stream {
server {
listen 9444;
preread_by_lua_block {
local fail_rate = tonumber(os.getenv("MOCK_FAIL_RATE") or "30")
or 30
local remote_port = tonumber(ngx.var.remote_port or "0") or 0
local seed = math.floor(ngx.now() * 1000000) + ngx.worker.pid()
+ remote_port
math.randomseed(seed)
local r = math.random(100)
if r <= fail_rate then
ngx.log(ngx.WARN, "random TLS mock selected hang before TLS
response, r=", r, ", fail_rate=", fail_rate)
ngx.sleep(60)
return ngx.exit(ngx.ERROR)
end
ngx.log(ngx.WARN, "random TLS mock selected normal TLS upstream,
r=", r, ", fail_rate=", fail_rate)
}
proxy_pass 127.0.0.1:9445;
}
}
http {
access_log /dev/stdout;
server {
listen 9445 ssl;
default_type application/json;
ssl_certificate /tmp/random-tls/server.crt;
ssl_certificate_key /tmp/random-tls/server.key;
location = /v1/chat/completions {
content_by_lua_block {
ngx.req.read_body()
local cjson = require("cjson.safe")
ngx.header["Content-Type"] = "application/json"
ngx.header["X-Mock-LLM"] = "random-tls"
local body = {
id = "chatcmpl-random-tls-mock",
object = "chat.completion",
created = ngx.time(),
model = "random-tls-mock",
choices = {
{
index = 0,
message = {
role = "assistant",
content = "mock response from random-tls",
},
finish_reason = "stop",
},
},
usage = {
prompt_tokens = 3,
completion_tokens = 4,
total_tokens = 7,
},
}
ngx.say(cjson.encode(body))
}
}
}
}
```
## Test steps
1. Save the four files above in the same directory.
2. Start APISIX 3.16.0: `docker compose up -d --force-recreate
--remove-orphans && docker compose exec -T apisix apisix version`
3. Send requests through APISIX:
```sh
for i in $(seq 1 40); do
curl -sS -o /dev/null -w '%{http_code} %{time_total}\n' -m 5 \
http://127.0.0.1:19080/random-tls/v1/chat/completions \
-H 'Content-Type: application/json' \
-d '{"messages":[{"role":"user","content":"ping"}]}'
done
```
4. Check timeout logs: `docker compose logs --since 2m apisix | grep
'openai-base.lua:366: failed to connect to LLM server: timeout'`
## Observed
With APISIX `3.16.0`, a 40-request run produced mixed 200 and 504 responses.
One observed run was `attempts=40 ok=33 timeout_504=7 other=0`.
I also reran the exact comment version with 10 batches of 40 requests. The
total result was `attempts=400 ok=288 timeout_504=112 other=0`, so the observed
reproduction rate was about `28%` with `MOCK_FAIL_RATE: "30"`.
The APISIX log count matched the timeout count:
```text
openai-base.lua:366: failed to connect to LLM server: timeout -> 112
```
The mock upstream log count also matched:
```text
hang before TLS response -> 112
normal TLS upstream -> 288
```
Representative APISIX log:
```text
openai-base.lua:366: failed to connect to LLM server: timeout, request:
"POST /random-tls/v1/chat/completions HTTP/1.1"
```
I also tested the same mock against APISIX `3.14.1`. It also timed out with
the same upstream behavior, but the log line was the old one:
```text
attempts=40 ok=28 timeout_504=12 other=0
openai-base.lua:229: failed to connect to LLM server: timeout -> 12
```
So this reproduces the intermittent 3.16 log line from the issue with a
simulated LLM upstream. It does not prove a 3.16-only regression by itself,
because the same upstream TLS-handshake stall also times out in 3.14.1 with the
old line number.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]