This is an automated email from the ASF dual-hosted git repository.
chenBright pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/brpc.git
The following commit(s) were added to refs/heads/master by this push:
new d8ddff8f Fix heap-use-after-free in ~AgentCombiner (#3291)
d8ddff8f is described below
commit d8ddff8f7b7b4f250fc1b2ea0816043747134ef9
Author: Bright Chen <[email protected]>
AuthorDate: Sun May 10 22:34:47 2026 +0800
Fix heap-use-after-free in ~AgentCombiner (#3291)
After PR #2949 changed Agent::combiner from a raw pointer to a
weak_ptr, MultiDimensionTest.shared started to fail under ASAN with a
heap-use-after-free in clear_all_agents() called from ~AgentCombiner.
Root cause
----------
std::shared_ptr makes its weak_ptr expire the instant use_count drops
to 0, while ~AgentCombiner only starts running after that. The two
events therefore form a race window between two threads:
* thread A: last shared_ptr<AgentCombiner> released
use_count -> 0 -> weak_ptr expired
~AgentCombiner -> clear_all_agents() takes _lock and
starts walking _agents
* thread B: thread exits, _destroy_tls_blocks() deletes ThreadBlock,
each ~Agent calls combiner.lock() which now returns null
(weak_ptr already expired), so commit_and_erase() is
skipped and the Agent is freed while still linked into
A's _agents list
* thread A: dereferences a freed LinkNode in clear_all_agents() ->
heap-use-after-free
Fix
---
Don't traverse _agents from ~AgentCombiner. Letting _agents go out of
scope is safe because:
* butil::LinkedList and butil::LinkNode have trivial destructors and
never traverse on destruction, so tearing down _agents does not
dereference any agent node.
* Surviving Agents simply observe combiner.expired() == true in
~Agent and skip commit_and_erase, leaving prev_/next_ dangling but
never read.
* If the freed agent_id is later reused by a new combiner and the
same TLS slot is taken, get_or_create_tls_agent calls Agent::reset
and Append's it into the new combiner's _agents.
LinkNode::InsertBefore only writes prev_/next_ (it never reads
their stale values), so the dangling pointers are safely
overwritten.
* Agent::element is destroyed together with its ThreadBlock, so any
non-POD resource it holds is still released; if the agent slot is
reused, Agent::reset rewrites the element value before it is
observed again.
The body of clear_all_agents() is kept (commented out) for reference,
together with a NOTE explaining why it must not be re-enabled from
~AgentCombiner.
---
src/bvar/detail/combiner.h | 70 +++++++++++++++++++++++++++++++++++++---------
1 file changed, 57 insertions(+), 13 deletions(-)
diff --git a/src/bvar/detail/combiner.h b/src/bvar/detail/combiner.h
index cae1b8ea..3007f50d 100644
--- a/src/bvar/detail/combiner.h
+++ b/src/bvar/detail/combiner.h
@@ -233,7 +233,38 @@ friend class GlobalValue<self_type>;
~AgentCombiner() {
if (_id >= 0) {
- clear_all_agents();
+ // NOTE: We intentionally do NOT walk `_agents` here (e.g. via the
+ // previously existed `clear_all_agents()`).
+ //
+ // `Agent` instances live inside per-thread `ThreadBlock`s owned by
+ // `AgentGroup` and are destroyed when their owning thread exits
+ // (via `_destroy_tls_blocks`). At that point `~Agent` calls
+ // `combiner.lock()`; if the combiner has already started its
+ // destruction the `weak_ptr` is expired and the agent will skip
+ // `commit_and_erase`, leaving its `LinkNode` linked to this
+ // combiner's `_agents`. If we tried to traverse `_agents` here we
+ // could touch agent nodes whose `ThreadBlock` was just freed by
+ // a concurrent thread-exit, causing heap-use-after-free
+ // (see issue #2937 follow-up).
+ //
+ // It is safe to leave the list "dirty" because:
+ // * `butil::LinkedList` / `butil::LinkNode` have trivial
+ // destructors and never traverse on destruction, so tearing
+ // down `_agents` here does not dereference any agent node.
+ // * After this combiner is gone, every still-alive `Agent` will
+ // observe `combiner.expired() == true` in `~Agent` and skip
+ // `commit_and_erase`, so the dangling `prev_/next_` pointers
+ // in those agents are never read.
+ // * If the freed `_id` is later reused by a new combiner and the
+ // same TLS slot is taken, `get_or_create_tls_agent` will call
+ // `Agent::reset` and `Append` the agent into the new
+ // combiner's `_agents`. `LinkNode::InsertBefore` only writes
+ // `prev_/next_` (never reads their stale values), so the
+ // dangling pointers are safely overwritten.
+ // * `Agent::element` is destroyed together with the
`ThreadBlock`,
+ // so any non-POD resource it holds is still released; if the
+ // agent slot is reused, `Agent::reset` will overwrite the
+ // element value before it is observed again.
AgentGroup::destroy_agent(_id);
_id = -1;
}
@@ -319,18 +350,31 @@ friend class GlobalValue<self_type>;
return agent;
}
- void clear_all_agents() {
- butil::AutoLock guard(_lock);
- // Resting agents is must because the agent object may be reused.
- // Set element to be default-constructed so that if it's non-pod,
- // internal allocations should be released.
- for (butil::LinkNode<Agent>* node = _agents.head(); node !=
_agents.end();) {
- node->value()->reset(ElementTp(), NULL);
- butil::LinkNode<Agent>* const saved_next = node->next();
- node->RemoveFromList();
- node = saved_next;
- }
- }
+ // NOTE: `clear_all_agents()` is intentionally kept but no longer called
+ // from `~AgentCombiner` (see the long comment in `~AgentCombiner`).
+ //
+ // Calling it from the destructor is unsafe: by the time the destructor
+ // runs, agent weak_ptrs have already expired and `~Agent` will skip
+ // `commit_and_erase`; a concurrent thread-exit can therefore free the
+ // `ThreadBlock` (and the agents inside it) while we are still walking
+ // `_agents` here, which is a heap-use-after-free.
+ //
+ // The body is left around (commented out) for reference / future use --
+ // do NOT re-enable it from `~AgentCombiner`.
+ //
+ // void clear_all_agents() {
+ // butil::AutoLock guard(_lock);
+ // // Resetting agents is a must because the agent object may be
+ // // reused. Set element to be default-constructed so that if it's
+ // // non-pod, internal allocations should be released.
+ // for (butil::LinkNode<Agent>* node = _agents.head();
+ // node != _agents.end();) {
+ // node->value()->reset(ElementTp(), NULL);
+ // butil::LinkNode<Agent>* const saved_next = node->next();
+ // node->RemoveFromList();
+ // node = saved_next;
+ // }
+ // }
const BinaryOp& op() const { return _op; }
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]