[I] Connect Attempts Bugs and Fixes [trafficserver]

via GitHub Sun, 17 May 2026 22:53:26 -0700


masaori335 opened a new issue, #13169:
URL: https://github.com/apache/trafficserver/issues/13169


   # Report of connect attempts bugs
   
   - affected versions: 10.1.x and smaller
   
   Relevant configs:
   
   - `proxy.config.http.connect_attempts_max_retries`
   - `proxy.config.http.connect_attempts_max_retries_suspect_server`
     (replaces the deprecated 
`proxy.config.http.connect_attempts_max_retries_down_server` —
     see PR #13102)
   - `proxy.config.http.connect_attempts_rr_retries`
   - `proxy.config.http.down_server.cache_time` (the `fail_window`)
   
   PRs:
   
   - https://github.com/apache/trafficserver/pull/12846
   - https://github.com/apache/trafficserver/pull/12880
   - https://github.com/apache/trafficserver/pull/13092
   - https://github.com/apache/trafficserver/pull/13102
   
   ## Net effect between 10.1.x and later
   
   On 10.1.x and earlier, the three `connect_attempts_*` configs behaved 
erratically:
   
   - `connect_attempts_max_retries_down_server` was effectively unreachable 
because the `is_down()`
     check was inverted (#12846).
   - Single-address origins ignored down-state entirely — `select_best_http` 
unconditionally returned
     the first (and only) entry (#12880).
   - The "mark host DOWN" threshold was gated by `connect_attempts_rr_retries` 
instead of
     `connect_attempts_max_retries` (#13102).
   - The fail-window clock used `client_request_time` instead of the actual 
failure time, so the window
     could start in the past or be pre-expired (#13102).
   - After `select_next_rr()`, `dst_addr` was not updated, so RR "switches" 
silently kept hitting the
     same target (#13102).
   - `mark_host_failure()` ran at end-of-txn against the post-switch 
`dns_info.active`, so failures on
     A could be charged to B (#13102).
   - RR exhaustion terminated retries early even when the current target still 
had per-host budget
     (#13102).
   
   The combined effect of PRs #12846, #12880, #13092, and #13102 is a coherent 
model in which each
   config has a well-defined role tied to an explicit `UP / DOWN / SUSPECT` 
state for the active
   target.
   
   ### Worked examples
   
   All examples use the ATS defaults:
   
   | Config                                          | Default |
   |-------------------------------------------------|---------|
   | `connect_attempts_max_retries`                  | 3       |
   | `connect_attempts_max_retries_suspect_server`   | 1       |
   | `connect_attempts_rr_retries`                   | 3       |
   | `down_server.cache_time` (`fail_window`)        | 60s     |
   
   #### Example 1 — Single-address origin goes offline
   
   Client sends repeated requests to an origin whose only A record points to a 
down IP. How many
   connect attempts does ATS make before the IP is marked DOWN in HostDB?
   
   **10.1.x behavior — IP marked DOWN after 12 connect attempts (3 failed 
transactions).**
   
   Why: `mark_host_failure()` is only called from 
`do_hostdb_update_if_necessary()`, which is only
   reached at the end of a transaction (via 
`handle_server_connection_not_open`). So `fail_count`
   increments **once per failed transaction**, not once per attempt. The 
threshold is
   `connect_attempts_rr_retries` (default `3`) — the wrong config, but that's 
what 10.1.x uses.
   Per-transaction there are `1 + connect_attempts_max_retries = 4` attempts.
   
   | Attempt | Txn | `fail_count` after | marked DOWN? |
   |---------|-----|--------------------|--------------|
   | 1       | 1   | 0                  | no           |
   | 2       | 1   | 0                  | no           |
   | 3       | 1   | 0                  | no           |
   | 4       | 1   | 1 (end of txn)     | no           |
   | 5       | 2   | 1                  | no           |
   | 6       | 2   | 1                  | no           |
   | 7       | 2   | 1                  | no           |
   | 8       | 2   | 2 (end of txn)     | no           |
   | 9       | 3   | 2                  | no           |
   | 10      | 3   | 2                  | no           |
   | 11      | 3   | 2                  | no           |
   | **12**  | 3   | 3 ≥ 3              | **yes**      |
   
   And because `select_best_http()` returns `info[0]` unconditionally for 
single-address records
   (#12880), every one of those 12 attempts actually hits the network. Even 
after the IP is marked
   DOWN, subsequent requests keep trying it because the single-address code 
path ignores down-state.
   
   **Later behavior — IP marked DOWN after 4 connect attempts (1 failed 
transaction).**
   
   Why: `do_hostdb_update_if_necessary()` is now called after **each** failure 
in
   `handle_response_from_server()`, so `fail_count` increments **once per 
attempt**. The threshold is
   `max_retries + 1 = 4` (for an UP target). Per-transaction there are still `1 
+ max_retries = 4`
   attempts, so the threshold is reached on the last attempt of the first 
failed transaction.
   
   | Attempt | Txn | `fail_count` after | marked DOWN? |
   |---------|-----|--------------------|--------------|
   | 1       | 1   | 1                  | no           |
   | 2       | 1   | 2                  | no           |
   | 3       | 1   | 3                  | no           |
   | **4**   | 1   | 4 ≥ 4              | **yes**      |
   
   Subsequent requests within `fail_window` (60s): `select_best_http()` now 
honors down-state for
   single-address records, so HostDB lookup returns `nullptr` and the SM 
reports `OriginDown`
   immediately — **zero connect attempts**.
   
   After `fail_window` elapses, the entry is SUSPECT. The next request gets
   `connect_attempts_max_retries_suspect_server + 1 = 2` attempts (1 probe + 1 
retry). Success → UP;
   failure → back to DOWN for another 60s.
   
   #### Example 2 — Round-robin with one bad backend (A fails, B/C healthy)
   
   Defaults apply: `rr_retries = 3`, `max_retries = 3`. A is unresponsive, B 
and C are fine. Initial
   active = A.
   
   **10.1.x behavior — client gets 502, and the wrong host is blamed.**
   
   Two compounding bugs: (a) after `select_next_rr()` is called, neither 
`dns_info.addr` nor
   `server_info.dst_addr` is updated, so the next connect still goes to A's 
address. (b)
   `mark_host_failure()` is called once at end-of-txn against 
`dns_info.active`, which by that point is
   the *post-switch* host (B), so B's `fail_count` is incremented for A's 
failures.
   
   | Attempt | dst_addr (target) | dns_info.active | retry_attempts after | 
`(retry+1) % 3 == 0`? | Result |
   
|---------|-------------------|-----------------|----------------------|----------------------|--------|
   | 1       | A                 | A               | 1                    | no  
                 | fail   |
   | 2       | A                 | A               | 2                    | no  
                 | fail   |
   | 3       | A                 | A → switched to B (`select_next_rr`); 
dst_addr untouched | 3 | **yes** | fail |
   | 4       | A (still!)        | B               | 3 ≥ 3, give up       | n/a 
                 | fail → 502 to client |
   
   End of txn: `mark_host_failure()` increments `B.fail_count` even though B 
was never tried. Repeated
   requests do this 3 times → **B gets marked DOWN** while A (the 
actually-broken host) keeps getting
   hit on every transaction.
   
   **Later behavior — client gets 200; A's failures are correctly recorded 
against A.**
   
   `do_hostdb_update_if_necessary()` runs after each failure (so the correct 
`dns_info.active` is in
   scope), and after `select_next_rr()` the SM explicitly assigns 
`dns_info.addr` and
   `server_info.dst_addr` to the new target.
   
   | Attempt | dst_addr (target) | dns_info.active | A.fail_count after | 
retry_attempts after | RR switch? | Result |
   
|---------|-------------------|-----------------|--------------------|----------------------|------------|--------|
   | 1       | A                 | A               | 1                  | 1     
               | no         | fail   |
   | 2       | A                 | A               | 2                  | 2     
               | no         | fail   |
   | 3       | A                 | A               | 3                  | 3     
               | **yes** → active=B, dst_addr=B | fail |
   | **4**   | **B**             | B               | 3 (unchanged)      | n/a   
               | n/a        | **success → 200** |
   
   A finishes this transaction with `fail_count = 3` (not yet at the threshold 
of
   `max_retries + 1 = 4`). The next request that hashes to A bumps it to 4 and 
marks A DOWN; subsequent
   requests within `fail_window` skip A entirely until the window expires.
   
   If B and C had also been DOWN (no selectable alternate), `select_next_rr()` 
returns false and the SM
   stays on A using its remaining per-host retry budget instead of giving up — 
new in #13102.
   
   **Later behavior**
   
   - Failure timestamp is `ts_clock::now()` at the moment `mark_down` is 
called, so the full
     `fail_window` is honored regardless of how long the transaction took to 
decide the host was bad.
   
   ## Summary of bug-fix PRs
   
   ### PR #12846 — Fix `HostDBInfo::is_down` condition
   
   **Bug.** `HostDBInfo::is_down()` had an inverted comparison:
   
   ```cpp
   // Before
   return (last_fail != TS_TIME_ZERO) && (last_fail + fail_window < now);
   // After
   return (last_fail != TS_TIME_ZERO) && (now <= last_fail + fail_window);
   ```
   
   The pre-fix code reported a host as DOWN only *after* `fail_window` had 
elapsed — the opposite of
   intent. During the fail window (when the host should be considered blocked) 
it returned `false`.
   
   **Behavior change.** A failed host is now correctly treated as DOWN for the 
entire `fail_window`
   following `last_failure`, and returns to UP only after the window has 
elapsed. Every downstream path
   that calls `is_down()` (retry budget, RR selection, negative-cached check) 
is impacted.
   
   ### PR #12880 — Check state of HostDBInfo in `select_best_http`
   
   **Bug.** For single-address (non round-robin) records, 
`HostDBRecord::select_best_http()` bypassed
   the state check:
   
   ```cpp
   // Before
   best_alive = &info[0];                       // unconditional
   // After
   if (info[0].select(now, fail_window)) {
     best_alive = &info[0];
   }
   ```
   
   A single-address host that was marked down would still be returned from 
HostDB and reused for the
   next request.
   
   **Behavior change.** Single-address origins now honor `down_server` state 
the same way RR origins
   do: if the only target is DOWN and still within the fail window, HostDB 
lookup fails (`nullptr`),
   the SM reports `OriginDown`, and no connection attempt is made until the 
window expires. The gold
   autest `dns_host_down` was updated accordingly (second request returns 500 
instead of 502 once the
   first failure marked the IP down).
   
   ### PR #13092 — Clarify `HostDBInfo` state
   
   **Not a bug fix — a model/refactor.** Introduces an explicit tri-state for 
upstream health and
   reshapes the API around it:
   
   ```cpp
   enum class HostDBInfo::State { UP, DOWN, SUSPECT };
   ```
   
   - **UP** — no known failure; normal selection.
   - **DOWN** — `_last_failure` set, `now` is within `fail_window`; not 
eligible.
   - **SUSPECT** — `fail_window` has elapsed; a probe is permitted. A 
successful response transitions
     it to UP via `mark_up()`; another failure returns it to DOWN.
   
   API renames: `is_alive` → `is_up`, `mark_active_server_alive` → 
`mark_active_server_up`.
   `mark_down` now takes `fail_window`. `select()` is replaced by callers using 
`is_down()` directly.
   `last_failure` / `fail_count` become private atomics (`_last_failure`, 
`_fail_count`).
   
   **Behavior change.** The previous two-state model collapsed SUSPECT into UP 
once `fail_window`
   expired, which meant a recovering host was treated identically to a 
never-failed host for retry
   sizing. With SUSPECT explicit, callers (notably the retry machinery in 
#13102) can apply
   `connect_attempts_max_retries_suspect_server` to probing traffic and 
`connect_attempts_max_retries`
   to UP traffic.
   
   ### PR #13102 — Fix connect attempt retries
   
   **Bugs.** The retry machinery in 
`HttpTransact::handle_response_from_server()` and
   `HttpSM::mark_host_failure()` had several interlocking issues:
   
   1. **Wrong threshold used to mark host down.** `mark_host_failure()` called
      `increment_fail_count(..., connect_attempts_rr_retries, ...)`. The RR 
switch-over config was
      reused as the "mark down" threshold, so a host was marked DOWN after
      `connect_attempts_rr_retries` failures instead of 
`connect_attempts_max_retries + 1` total
      attempts.
   2. **Wrong clock for failure timestamp.** `do_hostdb_update_if_necessary()` 
recorded the failure at
      `t_state.client_request_time` (request-receipt time), not 
`ts_clock::now()`. For long requests
      the fail window started in the past, sometimes expiring before the 
failure was even recorded.
   3. **RR exhaustion short-circuited all retries.** When the 
`connect_attempts_rr_retries` boundary
      was hit and no other RR member was selectable, the old code gave up, even 
though the current
      target still had per-host retry budget.
   4. **`connect_attempts_max_retries_down_server` had no effect in practice.** 
It was only consulted
      through `is_server_negative_cached()`, which depended on the inverted 
`is_down()` from #12846. In
      many cases it was either never applied or applied when it shouldn't have 
been.
   5. **`connect_attempts_max_retries_down_server == 0` blocked the probe.** A 
separate code path in
      `do_http_server_open()` refused to connect at all when the target was 
negative-cached and the
      config was zero — preventing the SUSPECT-state probe that the fail window 
was designed to
      allow.
   6. **Retry config range allowed overflow.** `[0-255]` with `uint8_t` 
threshold = `max_retries + 1`
      wraps to 0.
   7. **Config name didn't match its actual effect.** The "down_server" config 
never applied to
      DOWN-state origins (DOWN-state retries are now hardcoded to 0); it 
controls the probe budget for
      SUSPECT-state origins. Renamed to 
`proxy.config.http.connect_attempts_max_retries_suspect_server`
      to align with `HostDBInfo::State::SUSPECT`. The previous
      `proxy.config.http.connect_attempts_max_retries_down_server` record is 
kept and marked
      deprecated; if only it is set, its value is mirrored forward into the new 
record with a warning,
      and if both are set the new record wins. The legacy
      `TS_CONFIG_HTTP_CONNECT_ATTEMPTS_MAX_RETRIES_DOWN_SERVER` plugin enum 
continues to work — it
      aliases to the same internal field as the new `_SUSPECT_SERVER` enum.
   
   **Behavior change.**
   
   - Retry budget is now driven by HostDB state (via the new helper
     `HttpTransact::origin_server_connect_attempts_max_retries`):
     - `UP`      → `connect_attempts_max_retries`
     - `DOWN`    → 0 (bail out immediately)
     - `SUSPECT` → `connect_attempts_max_retries_suspect_server`
   - `mark_host_failure` uses `max_retries + 1` as the attempt budget and 
`ts_clock::now()` as the
     failure time.
   - RR switch and per-host retry are separated: if RR has no selectable 
alternate, the SM stays on
     the current target and keeps retrying up to the per-host budget rather 
than giving up.
   - The `do_http_server_open()` early-bail on "negative cached + 
down_server=0" is removed — DOWN vs
     SUSPECT is now the single source of truth.
   - `proxy.config.http.connect_attempts_max_retries`, 
`…_max_retries_suspect_server`, and
     `…_rr_retries` clamped to `[0-254]` to avoid the `uint8_t` overflow when 
adding 1 for the
     total-attempt count.
   - A new warning is logged at config reconfigure when
     `connect_attempts_max_retries_suspect_server == 0` and 
`connect_attempts_rr_retries > 0`, because
     that combination prevents any probe of a recovering (SUSPECT) origin.
   - `proxy.config.http.connect_attempts_max_retries_down_server` is now a 
deprecated alias of
     `…_suspect_server`. The struct field
     `OverridableHttpConfigParams::connect_attempts_max_retries_down_server` 
was removed; both
     records.yaml entries and both `TSOverridableConfigKey` enum values write 
into a single field, so
     existing operator configs and plugins keep working while the codebase has 
one source of truth.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[I] Connect Attempts Bugs and Fixes [trafficserver]

Reply via email to