bryancall commented on PR #13116:
URL: https://github.com/apache/trafficserver/pull/13116#issuecomment-4479671396

   Posting a summary of the design discussion plus the prod validation we just 
ran on e6.
   
   ## Why the SERVER_READ divert and not a response transform
   
   `TSHttpTxnErrorBodySet()` from a response-stage hook only worked end-to-end 
when the origin returned an error path; against a valid origin response the SM 
took the server-transfer branch and the plugin's body was discarded. The fix in 
this PR is a precise divert in `HttpSM::handle_api_return()` when the plugin 
set an internal body, the origin response is valid, no `plugin_tunnel` is in 
play, and the recorded hook id at `TSHttpTxnErrorBodySet()` time was 
`READ_RESPONSE_HDR` or `SEND_RESPONSE_HDR`.
   
   The natural alternative is to do what `txn_box`'s `upstream-rsp-body` does — 
install a `TS_HTTP_RESPONSE_TRANSFORM_HOOK` and rewrite body bytes inline. We 
considered that and chose the SERVER_READ divert for these reasons.
   
   ### Why the SERVER_READ divert
   
   - **Threat-model alignment.** The motivating issue is SSRF leakage of S3 403 
XML bodies (PSECBUGS-107834 / PSECBUGS-108568). `setup_internal_transfer()` 
releases the server session before the response body is consumed, so the 
sensitive origin body never crosses the cache write boundary or hits disk. A 
transform-based fix would still stream the leaky bytes through ATS and, with 
the default `cache_untransformed` behavior, land them in cache.
   - **Fixes the API, not just one plugin.** Any plugin calling 
`TSHttpTxnErrorBodySet()` from a response hook gets correct behavior, not only 
`header_rewrite`.
   - **Status code and cacheability decisions remain aligned with the origin 
response.**
   - **Content-Length and Content-Type are handled by the existing 
internal-transfer setup.**
   
   ### Trade-offs we accept
   
   - Touches `HttpSM` core, which is what made review careful. The divert is 
gated on five conditions (`internal_msg_buffer`, 
`!api_server_request_body_set`, `server_response.valid()`, `plugin_tunnel == 
nullptr`, captured-hook-id in the response stage) to keep the blast radius 
narrow.
   - `plugin_tunnel == nullptr` is the explicit `TSHttpConnect` intercept 
guard, replacing the indirect `api_hooks.get()` check.
   - The captured-hook-id check (`internal_msg_buffer_set_on`) scopes the 
divert to plugin-driven response replacement.
   
   ### Why not a response transform
   
   - Body bytes pass through ATS before being rewritten; the leaky origin 
payload could be persisted to cache by default. Mitigable, but adds 
configuration that is easy to get wrong.
   - It is a substantially larger plugin rewrite (transform vconn lifecycle, 
chunked-encoding handling, Content-Length recomputation, abort cleanup).
   - It does not address the underlying gap in `TSHttpTxnErrorBodySet()` 
behavior, so the next plugin to hit this learns the same lesson.
   
   ## Production validation
   
   Validated on an internal production edge host serving S3-backed routes 
(Yahoo edge, 10.0.x ATS line with this PR's patches applied; same Apache change 
set as PR #13116). The host has been running this build for 65+ hours with no 
incidents prior to the test.
   
   **Method:** appended three header-gated `header_rewrite` rules to an 
S3-backed remap's per-remap `header_rewrite` config (rules only fire when the 
client sends a magic `X-CDN-Set-Body-Test: 1` header, so real user traffic is 
unaffected), bumped `remap.config` mtime, ran `traffic_ctl config reload`. 
Drove ten curl requests with the magic header against a known S3-backed test 
endpoint (origin returns a 17 KB GIF). Reverted the config and reloaded again.
   
   **Result:**
   
   | Counter | Before | After | Delta | Expected |
   |---|---|---|---|---|
   | `proxy.process.http.origin_body_replaced` | 0 | 10 | +10 | +10 (one per 
curl) |
   | `proxy.process.http.origin_body_replaced_in_transform` | 0 | 0 | 0 | 0 (no 
transform in play) |
   | `proxy.process.http.origin_body_replaced_pre_transform` | 0 | 0 | 0 | 0 |
   | `plugin.header_rewrite ... test_set_body_origin.active_match` | 0 | 10 | 
+10 | +10 |
   | Curl response size | 17104 B (real GIF) | 41 B (replacement) | replaced | 
replaced |
   | Curl HTTP status | 200 | 200 | preserved | preserved |
   
   All ten requests returned HTTP 200 with the replacement body 
(`test-body-from-set-body-origin-experiment`, 41 bytes) instead of the 17 KB 
origin GIF. After reverting the rule and waiting for reload to propagate, the 
same curl returned the real GIF and counters held steady, confirming clean 
teardown with no lingering side effects.
   
   The pre-existing `proxy.process.api.error_body_set_response_hook` counter 
(which counts calls to `TSHttpTxnErrorBodySet()` from any response hook) was 
already ticking at roughly 36 calls/sec on background production traffic from 
another plugin (`txn_box`'s `security.txt` rule); that other path uses a 
response transform and does **not** trigger the new divert. 
`origin_body_replaced` correctly stayed at zero from background traffic and 
went up by exactly ten from our test, confirming the divert is scoped to the 
cases this PR targets.
   
   ## Automated test coverage
   
   - `header_rewrite_set_body_origin` AuTest passes, including the 
`null_transform` interaction case (TRANSFORM_READ branch).
   - Full SDK regression suite passes: `SDK_API_TSHttpConnectIntercept`, 
`SDK_API_TSHttpConnectServerIntercept`, `SDK_API_HttpAltInfo`, 
`SDK_API_HttpTxnTransform`, `SDK_API_HttpTxnCache`, `SDK_API_HttpSsn`, 
`SDK_API_HttpHookAdd`.
   - Exported symbols of `libtsapi` / `libtscore` / `libtsutil` / `libtscppapi` 
are unchanged (0 added, 0 removed).
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to