bneradt opened a new issue, #13101:
URL: https://github.com/apache/trafficserver/issues/13101

   ## Summary
   
   `traffic_server` can permanently deadlock during startup, after 
`init_accept_HttpProxyServer()` calls into remap plugin loading, between:
   
   - the dynamic loader lock (`_dl_load_lock` / `_rtld_global+2312`), held by 
the main thread inside `dlopen()` of a remap plugin DSO, and
   - `Diags::tag_activate_lock`, held by an already-running UDP net handler 
thread that is constructing a function-static `DbgCtl`.
   
   When this hits, the process stays alive but never binds the configured 
`proxy.config.http.server_ports`, so all incoming traffic is refused. Reliably 
reproduced in an ASan build of 10.2.x; almost certainly a latent bug in 
non-ASan builds too (just much narrower race window).
   
   ## Affected version
   
   - 10.2.0 built from the 10.2.x branch (built Apr 16 2026), 
`-DENABLE_ASAN=ON`, `-DENABLE_QUICHE=ON`, BoringSSL, gcc-10, on Ubuntu 20.04 / 
kernel 5.4.
   
   ## Symptoms
   
   - `systemctl status trafficserver` shows the service `active (running)`.
   - `ss -tlnp` shows no `traffic_server` listener on 80 or 443 (configured 
server_ports `80 80:ipv6 443:ssl 443:ipv6:ssl 443:quic 443:ipv6:quic`).
   - `traffic.out` is empty (process never finished early startup logging).
   - All worker threads (`ET_NET *`, `ET_AIO *`, `ET_UDP 0`, `ET_TASK *`, etc.) 
are spawned and idle/sleeping.
   
   ## Root cause: lock-order inversion
   
   Live `gdb` attach shows:
   
   **Thread 1 (`TS_MAIN`)** — owns `_dl_load_lock`, blocked on 
`Diags::tag_activate_lock`:
   
   ```
   __lll_lock_wait
   __GI___pthread_mutex_lock (mutex=Diags lock)
   Diags::lock                          DiagsTypes.h:268
   Diags::tag_activated                 src/tscore/Diags.cc:337
   Diags::debug_tag_activated
   DbgCtl::_new_reference               src/tsutil/DbgCtl.cc:154
   DbgCtl::DbgCtl ("conf_remap")
   __static_initialization_and_destruction_0   
plugins/conf_remap/conf_remap.cc:35
   _GLOBAL__sub_I_conf_remap.cc
   call_init / _dl_init
   __interceptor_dlopen                 (ASan-instrumented dlopen)
   PluginDso::load                      src/proxy/http/remap/PluginDso.cc:127
   RemapPluginInfo::load
   PluginFactory::getRemapPlugin
   remap_load_plugin
   remap_parse_config_bti
   UrlRewrite::BuildTable / load
   init_reverse_proxy                   src/proxy/ReverseProxy.cc:78
   init_accept_HttpProxyServer          
src/proxy/http/HttpProxyServerMain.cc:268
   main                                 
src/traffic_server/traffic_server.cc:2358
   ```
   
   **Thread N (`ET_UDP 0`)** — owns `Diags::tag_activate_lock`, blocked on 
`_dl_load_lock`:
   
   ```
   __lll_lock_wait    (mutex = _rtld_global+2312, the dl loader lock)
   __cxa_thread_atexit_impl                     cxa_thread_atexit_impl.c:114
   __cxa_thread_atexit                          (libstdc++)
   RegexContext::get_instance                   src/tsutil/Regex.cc:66
           // thread_local RegexContext ctx;
   Regex::exec                                  src/tsutil/Regex.cc:439
   Diags::tag_activated                         src/tscore/Diags.cc:339
   DbgCtl::_new_reference                       src/tsutil/DbgCtl.cc:154
   DbgCtl::DbgCtl ("v_udpnet-service")
   PacketQueue::advanceNow                      src/iocore/net/P_UDPNet.h:237
           // static DbgCtl dbg_ctl{"v_udpnet-service"};
   UDPQueue::service                            
src/iocore/net/UnixUDPNet.cc:1360
   UDPNetHandler::waitForActivity               
src/iocore/net/UnixUDPNet.cc:1894
   EThread::execute_regular
   ```
   
   The lock orders:
   
   - Main: `_dl_load_lock` -> `Diags::tag_activate_lock`
     (`dlopen(conf_remap.so)` -> static-init `DbgCtl{"conf_remap"}` -> 
`Diags::tag_activated` -> lock).
   - ET_UDP: `Diags::tag_activate_lock` -> `_dl_load_lock`
     (`PacketQueue::advanceNow` constructs function-static 
`DbgCtl{"v_udpnet-service"}` -> `Diags::tag_activated` -> `Regex::exec` -> 
`RegexContext::get_instance()` lazily constructs `thread_local RegexContext 
ctx;` -> libstdc++ registers the destructor via `__cxa_thread_atexit_impl`, 
which takes the rtld global lock).
   
   Result: a classic A->B / B->A deadlock during normal startup. The UDP net 
handler thread is started before plugin DSOs are loaded; once it ticks, it can 
take `Diags::tag_activate_lock` while the main thread is mid-`dlopen`.
   
   ## Why ASan makes it deterministic
   
   ASan's `__interceptor_dlopen` significantly slows symbol resolution and 
static initialization, widening the race window enough that the UDP thread 
reliably wins the race during plugin loading. Without ASan the timing is much 
tighter but the bug remains structurally present — any DSO load whose static 
initializers touch Diags (i.e. construct a `DbgCtl`) is at risk while UDP/QUIC 
threads are running.
   
   ## Suggested fixes (any one or combination)
   
   1. Avoid lazy `thread_local` registration on the Diags hot path. 
Pre-initialize per-thread `RegexContext` at thread spawn (or move it off 
`thread_local` storage) so `Regex::exec` doesn't transitively call 
`__cxa_thread_atexit_impl` (and therefore doesn't take the rtld lock) under 
`Diags::tag_activate_lock`.
   2. Don't take `Diags::tag_activate_lock` while invoking `Regex::exec`. 
Either snapshot the regex list under the lock and run `exec` outside it, or use 
a lock that doesn't nest with the rtld loader lock.
   3. Move function-static `DbgCtl` instances on hot/early threaded paths (e.g. 
`P_UDPNet.h:237` `v_udpnet-service`) to namespace-scope statics that get 
initialized before any worker threads are started.
   4. Defer remap plugin `dlopen` until after worker threads have been 
quiesced, or perform plugin DSO loads before starting UDP/net handler threads.
   
   ## Workaround
   
   Building without `-DENABLE_ASAN=ON` makes the deadlock rare (multi-hour 
uptime observed) but does not eliminate it.
   
   ## Related (separate) issue observed in the same deployment
   
   The previous instance crashed on `Fatal: 
src/proxy/http/HttpTransact.cc:8972: failed assertion '0'` from 
`HttpSM::update_stats()` -> `HttpSM::kill_this()` via 
`Http2Stream::main_event_handler` -> `update_size_and_time_stats(...)`. That is 
a different bug; will file separately if it isn't already tracked.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to