vchag opened a new issue, #63193: URL: https://github.com/apache/doris/issues/63193
### Search before asking - [x] I had searched in the [issues](https://github.com/apache/doris/issues?q=is%3Aissue) and found no similar issues. ### Version Doris version: 4.0.4 brpc version: 1.4.0 Number of BEs: 20 Ingest rate: 400 - 500 eps Ingest Mode: group_commit sync_mode Table type: DUPLICATE KEY ### What's Wrong? BE nodes crash with a segmentation fault (SIGSEGV) under sustained high-throughput ingestion. The crash occurs inside bvar::SamplerCollector::run() and is caused by a race condition in brpc 1.4.0's AgentCombiner: when a thread exits while SamplerCollector is iterating the agent list, it dereferences already-freed memory. At high EPS, the 28 global bvar::Adder<int64_t> instances in metadata_adder.h are updated tens of thousands of times per second across many worker threads, making this race reliably reproducible. Any single BE exceeding ~15–20K EPS is at risk, and multiple BEs typically crash within 30 minutes. ``` Segmentation fault (core dumped) 0# doris::signal::(anonymous namespace)::FailureSignalHandler at be/src/common/signal_handler.h:420 1# PosixSignals::chained_handler in /usr/lib/jvm/java/lib/server/libjvm.so 2# JVM_handle_linux_signal in /usr/lib/jvm/java/lib/server/libjvm.so 3# 0x00007F9881299520 in /lib/x86_64-linux-gnu/libc.so.6 4# bvar::Reducer<long, bvar::detail::AddTo<long>, bvar::detail::MinusFrom<long>>::SeriesSampler::take_sample() at thirdparty/installed/include/bvar/reducer.h:79 5# bvar::detail::SamplerCollector::run() in /opt/apache-doris/be/lib/doris_be ``` ### What You Expected? BE nodes should remain stable under sustained high-throughput ingestion. No crashes or segmentation faults should occur regardless of EPS, as long as the hardware and configuration are within supported limits. ### How to Reproduce? 1. Deploy Doris 4.0.4 with 20 BEs using `group_commit sync_mode` on a `DUPLICATE KEY` table. 2. Drive sustained ingestion at 400–500K EPS across the cluster, or ~15–20K EPS on a single BE. 3. Observe BE SIGSEGV within approximately 30 minutes. The crash is in `bvar::SamplerCollector::run()` as shown in the stack trace above. ### Anything Else? This is a known upstream bug in brpc, already fixed via [PR #2949](https://github.com/apache/brpc/pull/2949) (merged April 17, 2025), which converts `_combiner` to a `shared_ptr` to eliminate the use-after-free. > Doris 4.0.4 pins brpc at 1.4.0 in `thirdparty/vars.sh:208-209`, which predates this fix. We propose bumping the brpc pin to a version that includes PR #2949, and are willing to contribute the PR. Feedback on the preferred target brpc version is welcome. ### Are you willing to submit PR? - [x] Yes I am willing to submit a PR! ### Code of Conduct - [x] I agree to follow this project's [Code of Conduct](https://www.apache.org/foundation/policies/conduct) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
