AlinsRan opened a new pull request, #13582: URL: https://github.com/apache/apisix/pull/13582
### Description Fixes a cluster of active-health-check problems that share an apisix-side root cause: the health-check manager **destroys and rebuilds** the checker on every upstream change, even when only the upstream nodes changed and the `checks` config is identical. #### Root cause `fetch_checker()` keys the working checker by a version derived from **both** `modifiedIndex` and the nodes version (`upstream_utils.version`). So a node-only change bumps the version, `fetch_checker()` returns `nil`, and the resource is queued for an asynchronous rebuild. During that rebuild window: 1. `api_ctx.up_checker` is `nil`, so the balancer's health filtering is bypassed and traffic flows to nodes already known to be **unhealthy** for ~1-2s (#13282). 2. The rebuild discards the checker's accumulated health state and re-probes every node from scratch. `timer_working_pool_check` also destroyed the checker for any version mismatch, racing `timer_create_checker` and widening the nil window. #### Fix - Added `compute_targets()` / `sync_checker_targets()`: when the `checks` config is **unchanged** (compared with `core.table.deep_eq`) and only the nodes changed, reconcile the existing checker's targets in place via `add_target`/`remove_target` against the authoritative shm target list, keeping the checker and its health state alive. - `timer_working_pool_check` no longer destroys a checker for a node-only version change. - When a rebuild is genuinely required (the `checks` config changed, or no checker exists), the new checker is created and inserted into the working pool **before** the old one is stopped, eliminating the `up_checker == nil` gap. - The working-pool entry now stores the `checks` config so changes can be detected. #### Test `t/node/healthcheck-incremental-update.t`: - TEST 1: a node-only change must **not** destroy/rebuild the checker — asserts `create new checker` appears once (initial) and `clear checker` (delayed_clear) does **not**. - TEST 2: a `checks`-config change **must** still rebuild — asserts `clear checker` appears. > NOTE on local verification: I could not run the prove suite in my environment because the installed OpenResty (1.21.4.4) lacks the HTTP/3/QUIC support that the current `t/APISIX.pm` harness unconditionally requires (`listen 1994 quic`, `http3 on`), so nginx refuses to start for *any* test here. The diff/reconcile algorithm used by `sync_checker_targets` (add/remove against `get_target_list`) was validated separately against the real `lua-resty-healthcheck` library. **Please run CI / a maintainer with a QUIC-capable OpenResty to confirm the new `.t`.** #### Cross-PR dependency This depends on the companion library PR **api7/lua-resty-healthcheck#55** (clean every checker each cleanup window + release the periodic lock when idle), which fixes the related #13385 / #13141 / #13235 root cause inside the library. The rockspec dependency is bumped to `lua-resty-healthcheck-api7 = 3.3.0-0` as a placeholder; **the library fix is not yet released/tagged**, so this PR should not be merged until that library version is published and the version here is confirmed. Relates to #13282, #13385, #13141, #13235. ### Checklist - [x] I have explained the need for this PR and the problem it solves - [x] I have explained the changes or the new features added to this PR - [x] I have added tests corresponding to this change - [ ] I have updated the documentation to reflect this change (N/A) - [x] I have verified that this change is backward compatible (incremental update falls back to full rebuild whenever `checks` changes) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
