vchag opened a new issue, #63193:
URL: https://github.com/apache/doris/issues/63193

   ### Search before asking
   
   - [x] I had searched in the 
[issues](https://github.com/apache/doris/issues?q=is%3Aissue) and found no 
similar issues.
   
   
   ### Version
   
   Doris version: 4.0.4 
   brpc version: 1.4.0
   Number of BEs: 20 
   Ingest rate: 400 - 500 eps
   Ingest Mode: group_commit sync_mode
   Table type: DUPLICATE KEY 
   
   ### What's Wrong?
   
   BE nodes crash with a segmentation fault (SIGSEGV) under sustained 
high-throughput ingestion. The crash occurs inside 
bvar::SamplerCollector::run() and is caused by a race condition in brpc 1.4.0's 
AgentCombiner: when a thread exits while SamplerCollector is iterating the 
agent list, it dereferences already-freed memory.
   
   At high EPS, the 28 global bvar::Adder<int64_t> instances in 
metadata_adder.h are updated tens of thousands of times per second across many 
worker threads, making this race reliably reproducible. Any single BE exceeding 
~15–20K EPS is at risk, and multiple BEs typically crash within 30 minutes.
   
   ```
   Segmentation fault (core dumped)
   0# doris::signal::(anonymous namespace)::FailureSignalHandler
        at be/src/common/signal_handler.h:420
   1# PosixSignals::chained_handler
        in /usr/lib/jvm/java/lib/server/libjvm.so
   2# JVM_handle_linux_signal
        in /usr/lib/jvm/java/lib/server/libjvm.so
   3# 0x00007F9881299520
        in /lib/x86_64-linux-gnu/libc.so.6
   4# bvar::Reducer<long, bvar::detail::AddTo<long>,
      bvar::detail::MinusFrom<long>>::SeriesSampler::take_sample()
        at thirdparty/installed/include/bvar/reducer.h:79
   5# bvar::detail::SamplerCollector::run()
        in /opt/apache-doris/be/lib/doris_be
   ```
   
   
   
   ### What You Expected?
   
   BE nodes should remain stable under sustained high-throughput ingestion. No 
crashes or segmentation faults should occur regardless of EPS, as long as the 
hardware and configuration are within supported limits.
   
   ### How to Reproduce?
   
   1. Deploy Doris 4.0.4 with 20 BEs using `group_commit sync_mode` on a 
`DUPLICATE KEY` table.
   2. Drive sustained ingestion at 400–500K EPS across the cluster, or ~15–20K 
EPS on a single BE.
   3. Observe BE SIGSEGV within approximately 30 minutes. The crash is in 
`bvar::SamplerCollector::run()` as shown in the stack trace above.
   
   ### Anything Else?
   
   This is a known upstream bug in brpc, already fixed via [PR 
#2949](https://github.com/apache/brpc/pull/2949) (merged April 17, 2025), which 
converts `_combiner` to a `shared_ptr` to eliminate the use-after-free.
   
   > Doris 4.0.4 pins brpc at 1.4.0 in `thirdparty/vars.sh:208-209`, which 
predates this fix. We propose bumping the brpc pin to a version that includes 
PR #2949, and are willing to contribute the PR. Feedback on the preferred 
target brpc version is welcome.
   
   ### Are you willing to submit PR?
   
   - [x] Yes I am willing to submit a PR!
   
   ### Code of Conduct
   
   - [x] I agree to follow this project's [Code of 
Conduct](https://www.apache.org/foundation/policies/conduct)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to