Some attempt at bot-backed analysis courtesy of pi/gpt-5.5

----

## Executive summary

`radosgw` on Ubuntu Resolute / Ceph Tentacle crashes during Keystone-
backed S3 authentication because Boost.Asio's
`call_stack<thread_context, thread_info_base>` state is partially
initialized across mixed shared-library Asio configurations.

The immediate crash is:

```text
RGW http_manager thread
  -> ceph::async::detail::CompletionImpl<...>::destroy_post()
  -> Boost.Asio small-block recycling path
  -> pthread_getspecific(0)
  -> returns GnuTLS RNG/crypto state beginning "expand 32-byte k"
  -> Asio treats bytes at +8 ("2-byte k") as thread_info_base*
  -> SIGSEGV
```

The root cause is not GnuTLS. GnuTLS legitimately owns pthread key `0`.
The bug is that Ceph/RGW's Boost.Asio `thread_context` `top_` object is
marked initialized while its pthread key field remains zero.

Tracer v3 identifies the missing writer: `libboost_process.so.1.90.0`
exports an unversioned Boost.Asio guard symbol for
`call_stack<thread_context, thread_info_base>::top_`, but it does not
export or initialize the matching pthread-TSS `top_` key object. Its GOT
relocation for the guard resolves to `radosgw`'s GNU-unique guard. It
therefore sets `radosgw`'s guard to `1` before Ceph's pthread-TSS
constructor runs, causing Ceph to skip `pthread_key_create()` and leave
`radosgw`'s `top_.tss_key_ == 0`.

This is a mixed Boost.Asio TLS-model / DSO symbol-preemption bug:

```text
Ceph/radosgw:       BOOST_ASIO_DISABLE_THREAD_KEYWORD_EXTENSION => pthread TSS
libboost_process:  distro/default Boost.Asio compiler TLS behavior
```

## Affected / observed versions

Confirmed affected:

```text
Ubuntu resolute / 26.04
Ceph Tentacle radosgw 20.2.0-0ubuntu2
Ceph Tentacle PPA radosgw 20.2.1-0ubuntu1~bpo26.04.1~ppa202605042247
Ceph Tentacle PPA radosgw 20.2.1-0ubuntu1~bpo26.04.1~ppa202605272015
```

Vulnerable PPA `/usr/bin/radosgw` identity used for v3 proof:

```text
SHA256: 122a3f8640fed3d75d88d80d0b33676e9f1ae338f1d92400b04958ff1a7fd3b7
```

Validated compiler-TLS candidate `/usr/bin/radosgw`:

```text
SHA256: 63c88ad26eae42d4ee9793b8e887ee9ede5c107469aa011fa3679aa429f85aa1
Build ID: 46acf7cd0a5615bd1d1eae805a75377c61cf8538
```

## Background: why Ceph disabled Asio compiler TLS

Ceph upstream commit:

```text
29ee772263c7ab3c3bf33038bd989336ae3064ad
librados: workaround for boost::asio use of static member variables
```

added global:

```text
-DBOOST_ASIO_DISABLE_THREAD_KEYWORD_EXTENSION
```

and exported selected Asio `call_stack<...>::top_`/guard symbols from
`src/librados/librados.map`.

The commit addressed a real historical problem: Boost.Asio is header-
only and uses static member variables for thread-local call-stack state.
If Asio appears in multiple DSOs, such as `librados` and `librbd`, each
DSO can get separate static state unless the dynamic linker coalesces
the symbols correctly. The Ceph workaround forced Asio to use pthread
TSS and manually exported selected static variables so the loader could
consolidate them.

The current RGW crash shows that workaround is no longer safe as a
blanket process-wide assumption when the same process also loads distro
Boost libraries built with the default Asio compiler-TLS configuration.

## Root cause details

### Asio TLS model switch

Boost.Asio chooses its thread-specific pointer implementation
approximately like this:

```cpp
#if defined(BOOST_ASIO_HAS_THREAD_KEYWORD_EXTENSION)
  keyword_tss_ptr<T>   // compiler TLS, e.g. __thread/thread_local storage
#elif defined(BOOST_ASIO_HAS_PTHREADS)
  posix_tss_ptr<T>     // pthread_key_create/get/set
#endif
```

`BOOST_ASIO_DISABLE_THREAD_KEYWORD_EXTENSION` disables compiler TLS and
forces the pthread path.

In Ceph's forced pthread mode, `call_stack<thread_context,
thread_info_base>::top_` is a static object containing a
`pthread_key_t`. In compiler-TLS mode, the state is represented by TLS
symbols such as:

```text
boost::asio::detail::keyword_tss_ptr<...>::value_
```

### Proven failure sequence

1. `libgnutls.so.30` creates pthread key `0` during process startup.
2. `libboost_process.so.1.90.0` static initialization runs.
3. `libboost_process` has an unversioned Asio guard relocation for 
`call_stack<thread_context, thread_info_base>::top_`.
4. The relocation resolves to `radosgw`'s GNU-unique guard address.
5. `libboost_process` writes the guard byte to `1` but has no corresponding 
pthread-TSS `top_` key object and does not call `pthread_key_create()` for it.
6. Ceph/radosgw constructors later observe the guard as already initialized and 
skip `posix_tss_ptr_create()` for `radosgw`'s `thread_context` `top_`.
7. `radosgw`'s `thread_context` `top_` key remains zero.
8. Runtime Asio code calls `pthread_getspecific(0)` and reads GnuTLS state as 
if it were Asio call-stack state.

## Key evidence

Artifact root:

```text
/home/ubuntu/rgw-s3-crash-bug/artifacts/20260529T_asio_guard_tss_v3
```

Important files:

```text
RESULT.md
unit-logs/asio_guard_tls_tracer_v3.after-crash.log.gz
unit-logs/asio_guard_tls_tracer_v3.after-crash.summary.txt
unit-logs/live-top-inspection-v3.txt
unit-logs/v3-loaded-asio-symbols.txt
unit-logs/libboost_process-asio-thread-analysis.txt
unit-logs/live-libboost-process-got-inspection-v3.txt
static-analysis/static-init-and-relocation-report.md
reproducers/ceph-like-reproducer-summary.md
```

Tracer v3 key creation evidence:

```text
GnuTLS:          key_ptr=/usr/lib/.../libgnutls.so.30     key=0
libceph-common: key_ptr=/usr/bin/radosgw:...strand...top_ key=1
radosgw:        key_ptr=/usr/bin/radosgw:...await...top_  key=3
```

No `pthread_key_create()` was seen for `radosgw`'s
`thread_context/thread_info_base` `top_`.

Crash-path evidence:

```text
op=key_get key=0 ... 
caller=/usr/bin/radosgw:CompletionImpl<...>::destroy_post(...)+0x578
bytes16=657870616e642033322d62797465206b ascii=expand 32-byte k
```

Live memory after vulnerable restart:

```text
radosgw thread guard: 0x1
radosgw thread top_: 0x0
radosgw strand top_: key 1
radosgw await top_:  key 3
```

`libboost_process` evidence:

```text
/usr/lib/x86_64-linux-gnu/libboost_process.so.1.90.0
  exports guard variable for 
boost::asio::detail::call_stack<...thread_context...>::top_
  has R_X86_64_GLOB_DAT relocation for that guard
  does not export/create the matching pthread-TSS top_ key object
```

Disassembly shows guard-only initialization:

```asm
5fd2: mov GOT(guard for call_stack<thread_context,...>::top_), %r14
5fd9: cmpb $0x0,(%r14)
5fdf: movb $0x1,(%r14)   # if zero
```

Live GOT inspection proves the preemption target:

```text
libboost_process GOT_guard_thread_top -> 0x5f92865c5240
radosgw_thread_guard                  = 0x5f92865c5240 data=01...
libboost_process own_guard symbol     data=00...
radosgw_thread_top                    data=00...
```

## Validated mitigations / candidates

### 1. Preferred root-cause fix: use compiler TLS consistently in Ceph

Change:

```text
stop defining BOOST_ASIO_DISABLE_THREAD_KEYWORD_EXTENSION globally
```

Effect:

- Ceph uses the same Asio compiler-TLS path as distro Boost libraries.
- `radosgw`, `libceph-common`, and `librados` expose/use 
`keyword_tss_ptr<...>::value_` TLS symbols rather than pthread-key objects for 
this state.
- The vulnerable `pthread_getspecific(0)` path for Asio `thread_context` is 
eliminated.

Validation so far:

```text
25/25 Keystone-backed S3 attempts passed
5/5 wrapper attempts passed
post-v3 restore check: 3/3 attempts passed
service active, NRestarts=0
```

Caveat:

This can potentially reintroduce the old issue that commit `29ee772...`
worked around: split Boost.Asio static/TLS state across `librados`,
`librbd`, and other DSOs. A package-quality fix must audit and test
this, not just RGW.

Recommended package-quality follow-up:

- build full Debian packages, not only focused binaries;
- verify compile commands contain neither 
`BOOST_ASIO_DISABLE_THREAD_KEYWORD_EXTENSION` nor 
`BOOST_ASIO_DISABLE_SMALL_BLOCK_RECYCLING` unless deliberately scoped;
- audit symbols and relocations for `keyword_tss_ptr<...>::value_` and related 
guards across `radosgw`, `libceph-common`, `librados`, `librbd`, and any other 
Ceph Asio users;
- test librados/librbd Asio use cases specifically, including in-process 
multi-DSO scenarios;
- consider updating symbol maps if compiler-TLS Asio symbols must be 
exported/coalesced for the original librados/librbd use case.

### 2. Targeted containment: rebuild/vendor Boost.Process with Ceph's
Asio pthread-TSS macro

Idea:

Build a private `libboost_process.so.1.90.0` with:

```text
-DBOOST_ASIO_DISABLE_THREAD_KEYWORD_EXTENSION
```

and ensure `radosgw` loads it instead of the distro `libboost_process`.

Why it may help:

- It aligns Boost.Process with Ceph's pthread-TSS Asio mode.
- The guard-only writer should become a real pthread-TSS `top_` initializer 
that calls `pthread_key_create()`.
- It may preserve the intent of Ceph's old `librados` workaround while avoiding 
this specific mixed-mode crash.

Required validation:

- confirm `radosgw` actually loads the vendored Boost.Process;
- confirm no second system Boost.Process copy is loaded;
- inspect symbols: both guard and matching pthread-TSS `top_` state should be 
present;
- rerun tracer v3: `radosgw` thread guard must not be `1` while key remains `0`;
- rerun Keystone-backed S3 reproducer and broader RGW tests.

Pros:

- likely targeted fix for this exact `libboost_process` guard-preemption 
problem;
- avoids immediately changing Ceph's global Asio TLS model.

Cons:

- vendoring distro Boost shared libraries is high-maintenance;
- security updates and ABI compatibility become harder;
- only addresses `libboost_process`; any other Asio-using distro DSO with 
similar symbol behavior could still trigger a related problem;
- less attractive for Ubuntu/Debian packaging and upstream Ceph.

Status: plausible but not yet built/tested in this investigation.

### 3. Symbol isolation / visibility hardening for Ceph Asio detail
symbols

Idea:

Prevent external distro Boost DSOs from partially preempting Ceph/RGW's
Asio detail symbols, especially guard-only preemption.

Possible forms:

- hide Ceph's Boost.Asio detail symbols from the dynamic symbol table where 
safe;
- use version scripts to keep guard/top pairs local or consistently versioned;
- namespace or otherwise isolate Ceph's bundled/header-only Asio detail 
instantiations;
- ensure guard and top objects cannot be split across different TLS 
implementations.

Pros:

- attacks the symbol-preemption class directly;
- may preserve Ceph's pthread-TSS workaround without vendoring Boost libraries.

Cons:

- high risk: the original `librados` workaround intentionally exported some 
Asio symbols for cross-DSO coalescing;
- hiding them naively can reintroduce librados/librbd split-state bugs;
- requires careful ELF/version-script design and broader Ceph regression 
testing.

Status: conceptually valid, but riskier than aligning on compiler TLS.

### 4. RGW/radosgw-scoped `BOOST_ASIO_DISABLE_SMALL_BLOCK_RECYCLING`

Idea:

Build RGW/radosgw with:

```text
BOOST_ASIO_DISABLE_SMALL_BLOCK_RECYCLING
```

Effect:

- avoids the specific Asio small-block recycling deallocation path that 
dereferences the corrupted `thread_info_base` pointer;
- does not fix the underlying `guard=1, key=0` state.

Validation so far:

```text
fresh model baseline crashed
RGW/radosgw-scoped no-recycling candidate passed 25/25 Keystone-backed S3 
attempts
```

Pros:

- proven containment for the observed crash path;
- smaller behavioral change than a global Asio TLS-model change;
- potentially SRU-friendly as an emergency workaround.

Cons:

- not root cause;
- leaves invalid Asio state in the process;
- another Asio code path could still read the bad `thread_context` state later;
- may affect allocation performance.

Status: validated fallback containment, not preferred final fix.


### Root-cause package fix

Preferred path:

1. Produce a full Debian Ceph build that removes global 
`BOOST_ASIO_DISABLE_THREAD_KEYWORD_EXTENSION`.
2. Do not include the RGW no-recycling workaround in the root-fix build unless 
explicitly choosing fallback containment.
3. Audit Asio symbols/relocations across all Ceph DSOs, especially `librados` 
and `librbd`, to ensure the old multi-DSO issue is not reintroduced.
4. Add/execute librados/librbd Asio regression tests in addition to RGW 
Keystone-backed S3 tests.
5. Rerun RGW Keystone S3 reproducer at scale and broader RGW/Ceph smoke tests.

Alternative if compiler TLS is rejected or regresses:

1. Prototype vendored/rebuilt Boost.Process with 
`BOOST_ASIO_DISABLE_THREAD_KEYWORD_EXTENSION`.
2. Validate with tracer v3 that it initializes the matching pthread-TSS key and 
no longer leaves `radosgw` guard/key split.
3. Audit the process for any other loaded Asio-using distro DSOs with similar 
guard-only symbol behavior.

## Validation checklist for any final fix

A final package-quality fix should pass:

- Keystone-backed S3 CreateBucket/ListBuckets/DeleteBucket loop, at least 25/25;
- repeated wrapper run with clean exit;
- service remains `ActiveState=active`, `NRestarts=0`;
- no `SIGSEGV`, `core-dump`, `destroy_post()+0x586`, or 
`pthread_getspecific(0)` Asio crash in journal;
- compile command audit for intended Asio macros;
- symbol audit showing no mixed guard/top state across `radosgw`, 
`libceph-common`, `librados`, `librbd`, `libboost_process`;
- librados/librbd Asio regression coverage for the original `29ee772...` 
scenario.

## Bottom line

The crash is caused by a mixed Boost.Asio TLS model in one process.
Ceph's historical pthread-TSS workaround collides with distro
Boost.Process's compiler-TLS Asio symbols, allowing Boost.Process to set
`radosgw`'s Asio guard without creating `radosgw`'s pthread key.

The most robust long-term fix is to make Ceph and distro Boost libraries
use the same Asio TLS model, preferably compiler TLS, while explicitly
validating the old librados/librbd multi-DSO concern. The safest
already-validated containment is RGW/radosgw-scoped
`BOOST_ASIO_DISABLE_SMALL_BLOCK_RECYCLING`, but that should remain a
fallback because it does not repair the corrupted Asio state.

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/2154304

Title:
  ceph radosgw segv with keystone auth

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/ceph/+bug/2154304/+subscriptions


-- 
ubuntu-bugs mailing list
[email protected]
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

Reply via email to