AlinsRan opened a new pull request, #13582:
URL: https://github.com/apache/apisix/pull/13582

   ### Description
   
   Fixes a cluster of active-health-check problems that share an apisix-side 
root cause: the health-check manager **destroys and rebuilds** the checker on 
every upstream change, even when only the upstream nodes changed and the 
`checks` config is identical.
   
   #### Root cause
   
   `fetch_checker()` keys the working checker by a version derived from 
**both** `modifiedIndex` and the nodes version (`upstream_utils.version`). So a 
node-only change bumps the version, `fetch_checker()` returns `nil`, and the 
resource is queued for an asynchronous rebuild. During that rebuild window:
   
   1. `api_ctx.up_checker` is `nil`, so the balancer's health filtering is 
bypassed and traffic flows to nodes already known to be **unhealthy** for ~1-2s 
(#13282).
   2. The rebuild discards the checker's accumulated health state and re-probes 
every node from scratch.
   
   `timer_working_pool_check` also destroyed the checker for any version 
mismatch, racing `timer_create_checker` and widening the nil window.
   
   #### Fix
   
   - Added `compute_targets()` / `sync_checker_targets()`: when the `checks` 
config is **unchanged** (compared with `core.table.deep_eq`) and only the nodes 
changed, reconcile the existing checker's targets in place via 
`add_target`/`remove_target` against the authoritative shm target list, keeping 
the checker and its health state alive.
   - `timer_working_pool_check` no longer destroys a checker for a node-only 
version change.
   - When a rebuild is genuinely required (the `checks` config changed, or no 
checker exists), the new checker is created and inserted into the working pool 
**before** the old one is stopped, eliminating the `up_checker == nil` gap.
   - The working-pool entry now stores the `checks` config so changes can be 
detected.
   
   #### Test
   
   `t/node/healthcheck-incremental-update.t`:
   - TEST 1: a node-only change must **not** destroy/rebuild the checker — 
asserts `create new checker` appears once (initial) and `clear checker` 
(delayed_clear) does **not**.
   - TEST 2: a `checks`-config change **must** still rebuild — asserts `clear 
checker` appears.
   
   > NOTE on local verification: I could not run the prove suite in my 
environment because the installed OpenResty (1.21.4.4) lacks the HTTP/3/QUIC 
support that the current `t/APISIX.pm` harness unconditionally requires 
(`listen 1994 quic`, `http3 on`), so nginx refuses to start for *any* test 
here. The diff/reconcile algorithm used by `sync_checker_targets` (add/remove 
against `get_target_list`) was validated separately against the real 
`lua-resty-healthcheck` library. **Please run CI / a maintainer with a 
QUIC-capable OpenResty to confirm the new `.t`.**
   
   #### Cross-PR dependency
   
   This depends on the companion library PR **api7/lua-resty-healthcheck#55** 
(clean every checker each cleanup window + release the periodic lock when 
idle), which fixes the related #13385 / #13141 / #13235 root cause inside the 
library. The rockspec dependency is bumped to `lua-resty-healthcheck-api7 = 
3.3.0-0` as a placeholder; **the library fix is not yet released/tagged**, so 
this PR should not be merged until that library version is published and the 
version here is confirmed.
   
   Relates to #13282, #13385, #13141, #13235.
   
   ### Checklist
   
   - [x] I have explained the need for this PR and the problem it solves
   - [x] I have explained the changes or the new features added to this PR
   - [x] I have added tests corresponding to this change
   - [ ] I have updated the documentation to reflect this change (N/A)
   - [x] I have verified that this change is backward compatible (incremental 
update falls back to full rebuild whenever `checks` changes)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to