bneradt opened a new issue, #13101:
URL: https://github.com/apache/trafficserver/issues/13101
## Summary
`traffic_server` can permanently deadlock during startup, after
`init_accept_HttpProxyServer()` calls into remap plugin loading, between:
- the dynamic loader lock (`_dl_load_lock` / `_rtld_global+2312`), held by
the main thread inside `dlopen()` of a remap plugin DSO, and
- `Diags::tag_activate_lock`, held by an already-running UDP net handler
thread that is constructing a function-static `DbgCtl`.
When this hits, the process stays alive but never binds the configured
`proxy.config.http.server_ports`, so all incoming traffic is refused. Reliably
reproduced in an ASan build of 10.2.x; almost certainly a latent bug in
non-ASan builds too (just much narrower race window).
## Affected version
- 10.2.0 built from the 10.2.x branch (built Apr 16 2026),
`-DENABLE_ASAN=ON`, `-DENABLE_QUICHE=ON`, BoringSSL, gcc-10, on Ubuntu 20.04 /
kernel 5.4.
## Symptoms
- `systemctl status trafficserver` shows the service `active (running)`.
- `ss -tlnp` shows no `traffic_server` listener on 80 or 443 (configured
server_ports `80 80:ipv6 443:ssl 443:ipv6:ssl 443:quic 443:ipv6:quic`).
- `traffic.out` is empty (process never finished early startup logging).
- All worker threads (`ET_NET *`, `ET_AIO *`, `ET_UDP 0`, `ET_TASK *`, etc.)
are spawned and idle/sleeping.
## Root cause: lock-order inversion
Live `gdb` attach shows:
**Thread 1 (`TS_MAIN`)** — owns `_dl_load_lock`, blocked on
`Diags::tag_activate_lock`:
```
__lll_lock_wait
__GI___pthread_mutex_lock (mutex=Diags lock)
Diags::lock DiagsTypes.h:268
Diags::tag_activated src/tscore/Diags.cc:337
Diags::debug_tag_activated
DbgCtl::_new_reference src/tsutil/DbgCtl.cc:154
DbgCtl::DbgCtl ("conf_remap")
__static_initialization_and_destruction_0
plugins/conf_remap/conf_remap.cc:35
_GLOBAL__sub_I_conf_remap.cc
call_init / _dl_init
__interceptor_dlopen (ASan-instrumented dlopen)
PluginDso::load src/proxy/http/remap/PluginDso.cc:127
RemapPluginInfo::load
PluginFactory::getRemapPlugin
remap_load_plugin
remap_parse_config_bti
UrlRewrite::BuildTable / load
init_reverse_proxy src/proxy/ReverseProxy.cc:78
init_accept_HttpProxyServer
src/proxy/http/HttpProxyServerMain.cc:268
main
src/traffic_server/traffic_server.cc:2358
```
**Thread N (`ET_UDP 0`)** — owns `Diags::tag_activate_lock`, blocked on
`_dl_load_lock`:
```
__lll_lock_wait (mutex = _rtld_global+2312, the dl loader lock)
__cxa_thread_atexit_impl cxa_thread_atexit_impl.c:114
__cxa_thread_atexit (libstdc++)
RegexContext::get_instance src/tsutil/Regex.cc:66
// thread_local RegexContext ctx;
Regex::exec src/tsutil/Regex.cc:439
Diags::tag_activated src/tscore/Diags.cc:339
DbgCtl::_new_reference src/tsutil/DbgCtl.cc:154
DbgCtl::DbgCtl ("v_udpnet-service")
PacketQueue::advanceNow src/iocore/net/P_UDPNet.h:237
// static DbgCtl dbg_ctl{"v_udpnet-service"};
UDPQueue::service
src/iocore/net/UnixUDPNet.cc:1360
UDPNetHandler::waitForActivity
src/iocore/net/UnixUDPNet.cc:1894
EThread::execute_regular
```
The lock orders:
- Main: `_dl_load_lock` -> `Diags::tag_activate_lock`
(`dlopen(conf_remap.so)` -> static-init `DbgCtl{"conf_remap"}` ->
`Diags::tag_activated` -> lock).
- ET_UDP: `Diags::tag_activate_lock` -> `_dl_load_lock`
(`PacketQueue::advanceNow` constructs function-static
`DbgCtl{"v_udpnet-service"}` -> `Diags::tag_activated` -> `Regex::exec` ->
`RegexContext::get_instance()` lazily constructs `thread_local RegexContext
ctx;` -> libstdc++ registers the destructor via `__cxa_thread_atexit_impl`,
which takes the rtld global lock).
Result: a classic A->B / B->A deadlock during normal startup. The UDP net
handler thread is started before plugin DSOs are loaded; once it ticks, it can
take `Diags::tag_activate_lock` while the main thread is mid-`dlopen`.
## Why ASan makes it deterministic
ASan's `__interceptor_dlopen` significantly slows symbol resolution and
static initialization, widening the race window enough that the UDP thread
reliably wins the race during plugin loading. Without ASan the timing is much
tighter but the bug remains structurally present — any DSO load whose static
initializers touch Diags (i.e. construct a `DbgCtl`) is at risk while UDP/QUIC
threads are running.
## Suggested fixes (any one or combination)
1. Avoid lazy `thread_local` registration on the Diags hot path.
Pre-initialize per-thread `RegexContext` at thread spawn (or move it off
`thread_local` storage) so `Regex::exec` doesn't transitively call
`__cxa_thread_atexit_impl` (and therefore doesn't take the rtld lock) under
`Diags::tag_activate_lock`.
2. Don't take `Diags::tag_activate_lock` while invoking `Regex::exec`.
Either snapshot the regex list under the lock and run `exec` outside it, or use
a lock that doesn't nest with the rtld loader lock.
3. Move function-static `DbgCtl` instances on hot/early threaded paths (e.g.
`P_UDPNet.h:237` `v_udpnet-service`) to namespace-scope statics that get
initialized before any worker threads are started.
4. Defer remap plugin `dlopen` until after worker threads have been
quiesced, or perform plugin DSO loads before starting UDP/net handler threads.
## Workaround
Building without `-DENABLE_ASAN=ON` makes the deadlock rare (multi-hour
uptime observed) but does not eliminate it.
## Related (separate) issue observed in the same deployment
The previous instance crashed on `Fatal:
src/proxy/http/HttpTransact.cc:8972: failed assertion '0'` from
`HttpSM::update_stats()` -> `HttpSM::kill_this()` via
`Http2Stream::main_event_handler` -> `update_size_and_time_stats(...)`. That is
a different bug; will file separately if it isn't already tracked.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]