mkrug1981 opened a new issue, #13247:
URL: https://github.com/apache/trafficserver/issues/13247

   ## Bug: slice plugin fatal assertion — `TSHttpTxnEffectiveUrlStringGet` 
called with invalid txn in `read_resp_hdr` (introduced by #11618)
   
   ### Summary
   
   ATS crashes with a fatal assertion when the slice plugin calls 
`TSHttpTxnEffectiveUrlStringGet` inside its server intercept response handler. 
The call site was introduced by the Conditional Slicing feature (#11618). At 
the point `read_resp_hdr` fires inside the intercept PluginVC, the original 
transaction's HTTP state machine has already advanced past the hook phase where 
this API is valid, causing `sdk_sanity_check_txn` to fail and ATS to abort.
   
   This is a **fleet-wide, reproducible crash** observed on two independent 
hosts running the same slice.so build, confirming it is not an isolated 
incident.
   
   ### Version
   
   - **ATS version:** 10.1.3 (RPM build path references 10.1.2 sources — 
possible packaging issue worth investigating separately)
   - **OS:** RHEL 8 / Linux x86_64, kernel 4.18.0-553.x.el8_10
   - **Build date:** 2026-05-26
   
   ### Plugins loaded
   
   `slice.so`, `cache_range_requests.so`, `header_rewrite.so`, `compress.so`, 
`regex_remap.so`, `cachekey.so`, `background_fetch.so`, `maxmind_acl.so`, 
`tslua.so`
   
   ### Steps to reproduce
   
   1. Configure ATS with the `slice` plugin active on remap rules, with or 
without `--minimum-size` (conditional slicing).
   2. Serve traffic that triggers the slice intercept path (range requests or 
slice-eligible objects).
   3. Under production load, ATS fatally aborts.
   
   ### Observed behaviour
   
   Fatal assertion in `src/api/InkAPI.cc:3940`:
   
   ```
   Fatal: /rpmbuilddir/BUILD/trafficserver-10.1.2/src/api/InkAPI.cc:3940:
   failed assertion `TS_SUCCESS == sdk_sanity_check_txn(txnp)`
   ```
   
   Crash stack (from `traffic_crashlog`), identical across both affected hosts:
   
   ```
   Thread [ET_NET N]:
     crash_logger_invoke
     gsignal / abort
     ink_abort
     _ink_assert
     _TSReleaseAssert
     TSHttpTxnEffectiveUrlStringGet(txnp, &urllen)   ← assertion fires here
     handle_server_resp(tsapi_cont*, TSEvent, Data*)  [slice.so  +0x4eb]
     intercept_hook(tsapi_cont*, TSEvent, void*)      [slice.so  +0x41d]
     INKContInternal::handle_event
     PluginVC::process_write_side
     PluginVC::main_handler
   ```
   
   Identical function offsets (`handle_server_resp +0x4eb`, `intercept_hook 
+0x41d`) on both hosts confirm the same slice.so binary is affected fleet-wide.
   
   ### Root cause
   
   PR #11618 (Conditional Slicing) added a new `read_resp_hdr` function in 
`plugins/slice/server.cc` that calls `TSHttpTxnEffectiveUrlStringGet(txnp)` to 
look up the effective URL for updating the object size cache. This function is 
registered as a `TS_HTTP_READ_RESPONSE_HDR_HOOK` and also fires from within the 
intercept handler chain via `handle_server_resp`.
   
   When slice sets up a server intercept via `TSContCreate(intercept_hook, 
mutex)`, the `PluginVC` drives a fake origin exchange. When 
`handle_server_resp` fires inside that intercept, the `txnp` is the intercept 
pseudo-transaction. The original HTTP SM-backed transaction has already 
advanced past `TS_HTTP_READ_REQUEST_HDR_HOOK` state, so `sdk_sanity_check_txn` 
— which validates that the pointer refers to an `HttpSM` in an expected hook 
state — returns `TS_ERROR`, triggering the release assert.
   
   This is a fundamental API lifecycle violation: 
`TSHttpTxnEffectiveUrlStringGet` is only valid while the original transaction 
is in a hook callback, not from inside an intercept server response handler.
   
   ### Suggested fix
   
   The effective URL should be captured once in `read_request` (where `txnp` is 
fully valid) and stored in the `Data` struct, then read from there in 
`read_resp_hdr` instead of calling the API again.
   
   **1. Add a field to the `Data` struct:**
   
   ```cpp
   struct Data {
     // ... existing fields ...
     std::string effective_url; // captured at read_request time, before 
data.release()
   };
   ```
   
   **2. Capture in `read_request` before `data.release()`:**
   
   ```cpp
   int   urllen = 0;
   char *urlstr = TSHttpTxnEffectiveUrlStringGet(txnp, &urllen);
   if (urlstr != nullptr) {
     data->effective_url.assign(urlstr, urllen);
     TSfree(urlstr);
   }
   // ... then data.release() as before
   ```
   
   **3. Use the stored URL in `read_resp_hdr` (intercept path):**
   
   ```cpp
   // Remove: char *urlstr = TSHttpTxnEffectiveUrlStringGet(txnp, &urllen);
   // Replace with:
   Data *data = static_cast<Data *>(TSContDataGet(contp));
   if (data->effective_url.empty()) {
     // no URL available — update stats, reenable, return
   }
   std::string_view url = data->effective_url;
   ```
   
   **4. Non-intercept path (first-time large object discovery):**
   
   For the non-intercepted txn hook, allocate a small struct at hook 
registration time in `read_request` (while `txnp` is still valid), capture the 
URL there, and pass it as cont data. Destroy it in the response handler after 
use.
   
   ```cpp
   struct RespHdrData {
     Config     *config;
     std::string effective_url;
   };
   
   // In read_request, non-intercept branch:
   auto *rhdata = new RespHdrData{config, {}};
   int   urllen = 0;
   char *urlstr = TSHttpTxnEffectiveUrlStringGet(txnp, &urllen);
   if (urlstr) { rhdata->effective_url.assign(urlstr, urllen); TSfree(urlstr); }
   TSCont resp_contp = TSContCreate(read_resp_hdr, nullptr);
   TSContDataSet(resp_contp, rhdata);
   TSHttpTxnHookAdd(txnp, TS_HTTP_READ_RESPONSE_HDR_HOOK, resp_contp);
   // TSContDestroy + delete rhdata after use in the handler
   ```
   
   ### Expected behaviour
   
   The slice plugin should not call `TSHttpTxnEffectiveUrlStringGet` from 
within an intercept server response handler. The URL should be captured earlier 
in the transaction lifecycle and stored for later use.
   
   ### Workaround
   
   Disabling the `slice` plugin stops the crashes. No other workaround is known.
   
   ### Additional context
   
   - Crash confirmed on two independent production hosts with identical 
`slice.so` binary (confirmed by matching intra-plugin offsets despite different 
ASLR load addresses and plugin sandbox UUIDs).
   - The version string mismatch (binary reports 10.1.3, RPM build path shows 
10.1.2) may indicate a separate packaging issue.
   - Signal information and CPU registers are not present in the crashlog 
because the crash is triggered by `abort()` rather than a hardware fault — this 
is expected behaviour.
   - 74 ET_NET threads were active at crash time, indicating a high-concurrency 
production environment.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to