nic-6443 opened a new pull request, #13505:
URL: https://github.com/apache/apisix/pull/13505

   ### Description
   
   The `ai-proxy-multi` server picker can permanently ignore health check 
results on a worker.
   
   The picker lrucache key is `conf_version .. "#" .. sum(checker.status_ver)`. 
On the first request after a worker starts serving a route (or after a config 
change destroyed the checkers), the per-instance checkers don't exist yet — 
`healthcheck_manager` creates them asynchronously ~1s later. So that request 
builds and caches a picker **without any health filtering**, under a key ending 
in `#0`.
   
   When the checkers are created in this worker while the shared shm already 
contains all targets (added earlier by another worker), `add_target` raises no 
event, so every checker's `status_ver` stays 0 and the cache key never changes 
(`core.lrucache` also resurrects stale entries whose version still matches, so 
the TTL never forces a rebuild). The unfiltered picker is then reused 
indefinitely and keeps routing a share of requests to instances that the health 
checker has long marked unhealthy in shm — until an unrelated status flip or 
another config change happens to change the key.
   
   We hit this in production: an LLM instance was decommissioned (connection 
refused), active health checks correctly marked it unhealthy in shm, yet 
workers whose checkers were created after that transition kept sending ~1/N of 
the route's traffic to it for tens of minutes, returning 500s.
   
   The fix encodes each instance checker's existence into the cache key: a 
missing checker contributes `x` instead of a number, so the key changes from 
e.g. `V#x-x` to `V#0-0` as soon as the checkers exist and the picker is rebuilt 
with health filtering. This also makes the key per-instance instead of a lossy 
sum (`{1,2}` and `{2,1}` no longer collide).
   
   Note: the regular upstream path does not have this bug — 
`balancer.pick_server` only appends `"#" .. status_ver` when the checker 
exists, so the no-checker key already differs there. This change aligns 
ai-proxy-multi with that semantic.
   
   ### Which issue(s) this PR fixes:
   
   N/A
   
   ### Checklist
   
   - [x] I have explained the need for this PR and the problem it solves
   - [x] I have explained the changes or the new features added to this PR
   - [x] I have added tests corresponding to this change
   - [ ] I have updated the documentation to reflect this change
   - [ ] I have verified that this change is backward compatible
   
   The test reproduces the production sequence in one worker: a shadow checker 
seeds the shared shm (all targets exist, the dead one unhealthy) before any 
traffic, so the real checkers' `add_target` raises no event; without the fix 
the cached unfiltered picker keeps picking the dead instance (request fails 
with 500), with the fix it is rebuilt and only the healthy instance is used.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to