nic-6443 opened a new pull request, #13505:
URL: https://github.com/apache/apisix/pull/13505
### Description
The `ai-proxy-multi` server picker can permanently ignore health check
results on a worker.
The picker lrucache key is `conf_version .. "#" .. sum(checker.status_ver)`.
On the first request after a worker starts serving a route (or after a config
change destroyed the checkers), the per-instance checkers don't exist yet —
`healthcheck_manager` creates them asynchronously ~1s later. So that request
builds and caches a picker **without any health filtering**, under a key ending
in `#0`.
When the checkers are created in this worker while the shared shm already
contains all targets (added earlier by another worker), `add_target` raises no
event, so every checker's `status_ver` stays 0 and the cache key never changes
(`core.lrucache` also resurrects stale entries whose version still matches, so
the TTL never forces a rebuild). The unfiltered picker is then reused
indefinitely and keeps routing a share of requests to instances that the health
checker has long marked unhealthy in shm — until an unrelated status flip or
another config change happens to change the key.
We hit this in production: an LLM instance was decommissioned (connection
refused), active health checks correctly marked it unhealthy in shm, yet
workers whose checkers were created after that transition kept sending ~1/N of
the route's traffic to it for tens of minutes, returning 500s.
The fix encodes each instance checker's existence into the cache key: a
missing checker contributes `x` instead of a number, so the key changes from
e.g. `V#x-x` to `V#0-0` as soon as the checkers exist and the picker is rebuilt
with health filtering. This also makes the key per-instance instead of a lossy
sum (`{1,2}` and `{2,1}` no longer collide).
Note: the regular upstream path does not have this bug —
`balancer.pick_server` only appends `"#" .. status_ver` when the checker
exists, so the no-checker key already differs there. This change aligns
ai-proxy-multi with that semantic.
### Which issue(s) this PR fixes:
N/A
### Checklist
- [x] I have explained the need for this PR and the problem it solves
- [x] I have explained the changes or the new features added to this PR
- [x] I have added tests corresponding to this change
- [ ] I have updated the documentation to reflect this change
- [ ] I have verified that this change is backward compatible
The test reproduces the production sequence in one worker: a shadow checker
seeds the shared shm (all targets exist, the dead one unhealthy) before any
traffic, so the real checkers' `add_target` raises no event; without the fix
the cached unfiltered picker keeps picking the dead instance (request fails
with 500), with the fix it is rebuilt and only the healthy instance is used.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]