ojalberts-itc opened a new issue, #64708: URL: https://github.com/apache/doris/issues/64708
### Search before asking - [x] I had searched in the [issues](https://github.com/apache/doris/issues?q=is%3Aissue) and found no similar issues. ### Version We are on Apache Doris 4.0.6 GA (x86_64, AVX2). The running build string reads `doris-4.0.6-rc02`, which is misleading -- it is the GA, not a hand-picked release candidate. The official `apache-doris-4.0.6-bin-x64.tar.gz` was cut from the `4.0.6-rc02` tag and the embedded build string was never bumped to drop the `-rc02` suffix. We verified the artifact two independent ways so this is not dismissed as an RC: 1. Build-commit match. Our binary reports commit `1663f25c16f`; the Apache Doris 4.0.6 release is commit `1663f25`. A binary can only embed that commit if it was built from that exact tree. 2. Cross-mirror sha512 match. `apache-doris-4.0.6-bin-x64.tar.gz.sha512` is byte-identical on the official release host and the mirror our deploy pulls from: `8f869c4399088d3dc34e5ade10047495e42c7c0583fb32156adaf0794a56e5942b8c0142c05fc145d58d4148daf0ee8d0dde73c9aab0224f39b2435f406c8ef8`. MySQL-wire version reported: `5.7.99`. We have not yet tested 4.1.x. ### What's Wrong? On a coupled-mode 4.0.6 cluster, the BE write path wedges. `INSERT` / `CREATE TABLE AS SELECT` / MV-refresh hang and then fail with `failed to write enough replicas N/M ... due to connection errors`, while every node still reports `Alive=true` and reads keep working. Once wedged, only a full restart of all BE processes recovers it -- a single-BE restart does not. We chased this for a full day with live instrumentation and found three distinct causes, not one. Two were ours and are fixed. The third is the reason for this report: a BE-to-BE load-stream brpc socket on port 8060 goes "Broken" and is never revived, and we cannot fix it from the 4.0.6 config surface. We believe it is an upstream defect of the Apache brpc [#1168](https://www.mail-archive.com/[email protected]/msg03092.html) class. | # | Cause | Trigger | Status | | - | ----- | ------- | ------ | | 1 | Our security group was missing a BE-to-BE self-ingress on port 8040 (`webserver_port`, clone snapshot download), so clone REPAIR could never complete and the FE ran an unbounded repair-clone storm | replica repair / single-BE restart | Fixed -- our IaC bug, not Doris | | 2 | brpc load-stream-open stall on 8060 under heavy multi-replica load | heavy multi-replica `INSERT` | Mitigated with `experimental_enable_single_replica_insert=true` | | 3 | A BE-to-BE load-stream brpc socket (8060) goes "Broken" and is never revived | accumulation of ~6--7 BE-to-BE stream opens | Open -- suspected upstream defect, this report | Cause #1 was our mistake. We include it so it is clear the residual (cause #3) is independent of it: after we fixed the security group and clones completed cleanly, cause #3 still reproduces. ### The bug (cause #3) On a healthy cluster, repl=3 writes succeed. After roughly 6--7 successful BE-to-BE load-stream-open operations, a specific brpc socket on the BE-to-BE load-stream path (8060) enters a state where the next load-stream-open RPC to the affected peer parks until the RPC timeout (~534 s) and then fails, taking down the whole write path. The socket is never revived. Reads and the Thrift heartbeat (9050, a separate threadpool) stay healthy the entire time, so `SHOW BACKENDS` shows `Alive=true` throughout. #### Signature (verbatim, BE `be.WARNING` and FE `fe.log`) Coordinator-side open failure -- 60s is the `tablet_writer_open_rpc_timeout_sec` default: ```text load_stream_stub.cpp:591 open stream failed: [INTERNAL_ERROR]Failed to connect to backend <id>: [E1008]Reached timeout=60000ms @10.0.0.105:8060 ``` The long park -- the RPC itself stalls ~534s before failing: ```text brpc_closure.h RPC meet failed: [E1008]Reached timeout=533999ms @10.0.0.105:8060 brpc_closure.h RPC meet failed: [E1008]Reached timeout=533999ms @10.0.0.118:8060 brpc_closure.h RPC meet failed: [E1008]Reached timeout=533999ms @10.0.0.155:8060 ``` Loopback proof that this is in-process, not network or security group -- a BE times out opening a tablet-writer to its own 8060, and cancels a load-stream whose source and destination are the same BE: ```text load_id=..., txn_id=6, node=10.0.0.118:8060, open failed, err: ... RPC call is timed out, error_text=[E1008]Reached timeout=60000ms @10.0.0.118:8060, host: 10.0.0.118 load_stream_stub ... src_id=...499, dst_id=...499, stream_id=1740 is cancelled ... write enough replicas 1/3 ``` ```text brpc_client_cache.h:326 open brpc connection to 10.0.0.105:8060 failed: [E1008]Reached timeout=60000ms ``` User-facing FE error: ```text failed to open DeltaWriter <id>: failed to write enough replicas 1/3 for tablet <id> due to connection errors ``` At the original bring-up wedge the same error read `... 0/1 ...`. #### In-process capture at a live wedge (pstack + bvar) We captured the in-process BE state during a live wedge (2026-06-22, `doris-4.0.6-rc02`, commit `1663f25c16f`), before the recovery restart. Full dumps are attached. A write thread (`gstack`, BE A) is parked in the brpc load-stream OPEN -- the V2 path: ```text bthread_id_join -> brpc::Channel::CallMethod -> doris::FailureDetectChannel::CallMethod be/src/util/brpc_client_cache.h:121 -> doris::LoadStreamStub::open be/src/vec/sink/load_stream_stub.cpp:195 (txn_id=4442, total_streams=2, idle_timeout_ms=30000) -> doris::LoadStreamStubs::open be/src/vec/sink/load_stream_stub.cpp:574 -> doris::vectorized::VTabletWriterV2::_open_streams_to_backend be/src/vec/sink/writer/vtablet_writer_v2.cpp:317 -> doris::vectorized::VTabletWriterV2::_open_streams vtablet_writer_v2.cpp:296 -> doris::vectorized::VTabletWriterV2::open vtablet_writer_v2.cpp:272 -> doris::vectorized::AsyncResultWriter::process_block be/src/vec/sink/writer/async_result_writer.cpp:119 -> doris::vectorized::AsyncResultWriter::start_writer async_result_writer.cpp:105 -> doris::ThreadPool::dispatch_thread ``` On another BE the same root appears via the V1 writer path -- `VNodeChannel::open_wait` (`vtablet_writer.cpp:704`) -> `bthread_id_join`. Both are parked on the brpc load-stream OPEN RPC to a peer backend. It is **not worker-pool exhaustion, not compaction, not a stub leak** -- brpc `/vars` (8060) and `/metrics` (8040) at the wedge: | BE | `bthread_worker_usage` / count | `load_channel_count` | `tablet_writer_count` | `brpc_stream_endpoint_stub_count` | compaction (base+cumulative) | | -- | ------------------------------ | -------------------- | --------------------- | --------------------------------- | ---------------------------- | | A | 0.20 / 256 | 2 | 8 | 4 | 0 | | B | 54.6 / 256 | 3 | 9 | 4 | 0 | Workers are nowhere near the 256 ceiling -- the write threads are parked on the RPC, not starved. Load channels and tablet writers are open and stuck; the stub count is the steady-state 4 (no leak); compaction is fully idle. `rpcz` was empty (off by default; `:8060/rpcz/enable` did not enable it at runtime on this build), so the parked-RPC evidence is the `gstack` above. `load_stream_stub` cancellations appear across all BEs for the same load. #### Trigger: accumulation, not a timer or idle decay - Across two instrumented runs the wedge fired after 7 OK then wedge, and 6 OK then wedge, repl=3 write operations. It tracks the number of BE-to-BE load-stream opens, not a wall-clock interval. - It fires both during a heavy multi-replica load and ~7--11 minutes after a load while the cluster is otherwise idle (no further writes issued). - A restart followed by 60 minutes of pure idle with no load did not wedge. So it is load-induced, not idle decay. - `brpc_stream_endpoint_stub_count` stayed at 4 across the wedge -- no stub-count leak. It is a specific socket going Broken, not stub exhaustion. #### Recovery A full restart of all BE processes clears it. A single-BE restart does not -- the rejoined BE's peers still hold the broken stub, so it rejoins a wedged mesh. ### What You Expected? When a BE-to-BE load-stream brpc connection breaks, brpc should revive it (or the load-stream-open RPC should fail fast and the channel reconnect), so the write path recovers on its own. Instead the open RPC parks ~534s and the entire write path wedges while every node still reports `Alive=true`, and only a full BE-fleet restart clears it. A single broken socket should not require dropping all BE processes to recover. ### How to Reproduce? 1. Coupled-mode 4.0.6 cluster, 3 FE + 4 BE, default replication 3, stock be.conf. 2. Create native UNIQUE-KEY merge-on-write tables, `replication_num=3`, `DISTRIBUTED BY HASH(...) BUCKETS 16`. 3. Run a sequence of multi-replica writes that each open BE-to-BE load streams -- repeated `INSERT ... SELECT` / `CREATE TABLE AS SELECT` of a few million rows. In our case, four such loads plus a handful of `UPDATE ... FROM` statements per cycle. 4. After ~6--7 such operations -- during the load, or within ~7--11 minutes after -- a write hangs and fails `write enough replicas N/3 ... connection errors`. `be.WARNING` shows `[E1008]Reached timeout ... @<be>:8060`, including a loopback `@<self>:8060`. 5. `SELECT 1` and `SHOW BACKENDS` (`Alive=true`) keep working. Only a full BE restart recovers. We have not reduced this to a minimal standalone reproducer; it reproduces reliably under our normal multi-replica load. We will run a targeted reproducer if you suggest one. ### Anything Else? ### Search / prior art We searched the issue tracker, the `load_stream` / move-memtable PR history, the 4.1.x changelogs, and community forums (English and Chinese) before filing, and found no exact match for the full signature. The closest structural match is Apache brpc #1168 -- after a downstream node fault the upstream socket enters a "Broken" state and the health check never revives it; recovery requires restarting the upstream. Adjacent load-stream lifecycle fixes already in 4.0.6: #34883, #39231 / #39762, #60148, #60285. Possibly related and unconfirmed for 4.0.x: #56120 ("close brpc stream after load stream is closed"). If a maintainer recognises this as known or already fixed, a pointer to the PR is the fastest resolution. ### Environment - Mode: coupled (storage-compute together), FE + BE only. No FoundationDB / Meta Service / Recycler / S3 storage vault. Native tablet data lives on BE-local EBS. - Topology: 3 FE (HA followers) + 4 BE. Each node 8 vCPU / 64 GiB RAM. - BE storage: one dedicated 500 GB gp3 volume per BE, xfs (`noatime,nodiratime`), mounted `/var/lib/doris/storage`, gp3 baseline 3000 IOPS / 125 MiB/s. - OS / JDK: Amazon Linux 2023, Amazon Corretto 17. - Replication: Doris default 3, across 4 BEs. - Workload: read Apache Iceberg through a Glue / S3 external catalog, then write the aggregated result into native Doris UNIQUE-KEY merge-on-write tables -- `CREATE TABLE AS SELECT`, `INSERT ... SELECT`, and a few `UPDATE ... FROM` statements. About 5M rows per table, 4 tables. - be.conf is effectively stock. The only non-default overrides are `mem_limit = 80%`, `storage_root_path`, and `priority_networks = <self>/32`. No brpc / clone / timeout tuning was set initially. Full dump at the end. ### What we ruled out, with positive evidence All of the environment-layer suspects below were tested **directly, while wedged** (raw probes on 2026-06-22). | Hypothesis | Verdict | Evidence | | ---------- | ------- | -------- | | TCP / network / security group / routing on 8060 | Ruled out | Raw TCP (`/dev/tcp`) to `:8060` is **OPEN to the peer and over loopback to self** on all 4 BEs while the brpc RPC on the same port times out; the listener is healthy (`LISTEN 0 1024 0.0.0.0:8060`). A brpc call failing to its own loopback `:8060` while raw TCP to that port succeeds cannot be network/SG/routing. | | Host firewall (iptables / nftables / firewalld) | Ruled out | All 4 BEs: `iptables` 0 non-policy rules (default-ACCEPT), `ip6tables` 0, `nft` ruleset empty, `firewalld` inactive/absent. No host firewall exists. | | SELinux | Ruled out | `getenforce` = **Permissive** on all 4 (policy `targeted`, mode permissive) -- it logs but cannot block. | | ENA bandwidth throttle | Ruled out | `bw_in/out_allowance_exceeded` are non-zero **cumulative** but **Δ=0 over a 50s sample during the idle wedge** (they moved only during the loads); `pps_allowance_exceeded`=0, `conntrack_allowance_exceeded`=0. No active throttle while wedged. | | conntrack / ephemeral ports | Ruled out | `nf_conntrack` module not loaded; ~53--60 of ~28k ephemeral ports used, 3 TIME-WAIT. Neither is exhausted. | | Kernel / OOM / packet drops | Ruled out | `dmesg` / `journalctl -k` show no drop/deny/reject/oom/conntrack/throttle lines for the window. | | Deployment / OS-tuning misconfig | Ruled out | Our install sets all Doris-required kernel tuning (`vm.max_map_count=2000000` -- live-confirmed, swap off, `nofile` 655350, THP madvise) and **runs `start_be.sh`'s preflight**, which the official `apache/doris` container deployment *skips* (`SKIP_CHECK_ULIMIT=true`). The official FE/BE images add no brpc/network/timeout config we lack -- only `priority_networks`. So it is not a deployment misconfiguration. | | Compaction / merge-on-write delete-bitmap publish | Ruled out | Captured live at the wedge: every compaction metric is 0 on all 4 BEs -- `doris_be_compaction_task_state_total{base,cumulative}=0`, `doris_be_disks_compaction_score=0`, `doris_be_compaction_used_permits=0`, `doris_be_compaction_waitting_permits=0`, `doris_be_load_channel_count=0`, `doris_be_tablet_writer_count=0`. | | Resource exhaustion (CPU / memory / IO) | Ruled out | At the wedge the BEs are near-idle: load avg ~0.0--0.09, ~55--60 GB RAM free, `doris_be` at 2--3% CPU. EBS volumes idle (`VolumeReadOps=0`, under 1 write IOPS, `VolumeQueueLength` ~0). | | BE soft memory limit / flush back-pressure | Ruled out | Workload-group `total_mem_used` 0--158 MB against an ~53 GB limit; zero memory-exceed or MemoryGc-cancel lines. Memory would climb if flush stalled. | | Crash / auto-restart / kernel OOM | Ruled out | `NRestarts=0`, single MainPID for the whole window on every BE; `dmesg` and `journalctl -k` empty for the window. | | Replication factor (repl=3 itself) | Ruled out | A fresh-cluster full 4-table build at repl=3 completed cleanly and sustained, 0 errors. Earlier "repl=3 triggers it" readings were confounded by clusters already degraded by prior single-BE-restart experiments. | | BE thread-pool exhaustion | Not the cause | No BE thread pool is pegged at the wedge: EvHttpServer at pool size 128, pipeline schedulers at normal 8/16, no compaction or memtable pool active. | The one mechanism consistent with all of this is a brpc load-stream socket going Broken and never being revived: raw TCP to `:8060` connects (peer and loopback) while every brpc RPC on it times out; Doris's own health check evicts the stub (`remove brpc stub from cache`) and recreates it, and the new stub still times out; the errors are connect/open *timeouts* (never "Connection refused" or "reset"); and it clears only when the process is dropped. That is the Apache brpc #1168 class. ### Config we tried that did not fix it | Setting | Where | Result | | ------- | ----- | ------ | | `enable_brpc_connection_check = true` | be.conf, immutable, rolling restart | No effect. This is the mechanism that should periodically check brpc connections and close/recreate broken ones (`brpc_connection_check_timeout_ms` = 10s default), but it did not revive the broken load-stream socket. Wedged again at +8 minutes. Kept as general hardening. | | `experimental_enable_single_replica_insert = true` | FE global var | Partial and unreliable. Loads write one replica and clone the rest, so a single load completes instead of hanging, but the idle wedge still fires afterward and a later load still hung despite the setting. | We did not raise `tablet_writer_open_rpc_timeout_sec` or `brpc_socket_max_unwritten_bytes` beyond defaults, because those mask the symptom -- a longer park -- rather than revive the socket. If you believe a specific brpc knob is the fix, we will test it. ### The two causes we fixed ourselves We are listing these so it is clear the residual is isolated, and because one of them was our mistake and we would rather name it than route around it. - Cause #1, our security-group bug -- fixed. Our BE security group self-referenced 8060 (brpc) and 9060 (be_port) but not 8040 (`webserver_port`, the HTTP port used for clone snapshot download between BEs). Loads over brpc 8060 worked, but clone REPAIR over `http://<be>:8040/api/_tablet/_download` timed out (`[HTTP_ERROR]Connection timed out after 15000 milliseconds`), so missing replicas never healed and the FE ran an unbounded VERY_HIGH repair-clone storm that saturated the BEs. Adding the 8040 BE-to-BE self-ingress rule fixed it: clones finish, drain to 0, replicas heal. This was our infrastructure error, not a Doris defect. We mention it only because, once fixed, cause #3 still reproduces -- which proves #3 is independent of it. - Cause #2, load-stream-open stall under heavy multi-replica load -- mitigated. Distinct from the 8040 clone path; this is on the 8060 write path. Mitigated, not cured, by `experimental_enable_single_replica_insert`. ### Detection and the workaround we run today - Detection. The Thrift heartbeat (9050) runs on a separate threadpool from the brpc write path (8060), so `SHOW BACKENDS ... Alive=true` is not a writability signal -- it stayed green for ~2.5 hours while every write was dead. We added a write-readiness canary, a small bounded `INSERT` over the 8060 path, to our health check, and a wedge now surfaces in seconds instead of hours. - Recovery. A full BE-fleet restart. A single-BE restart does not clear it. ### Code-level analysis (Doris 4.0.6, bundled brpc 1.4.0) We traced the captured stacks/logs into the 4.0.6 source. The load-bearing finding, from the **target** BE's `be.INFO` during the wedge: - `PInternalService::open_load_stream` logs `"open load stream, load_id=..."` (internal_service.cpp:416) as the first line of the handler. During the wedge there were **0** such handler-entry lines on the target BEs in the wedge window, versus **1700+** historically -- while the BE worker pools sat **idle** (pstack: threads parked in `blocking_get`, not saturated; a saturated pool would fail `try_offer` fast, not time out at 60s). - So the inbound `open_load_stream` RPC **never reaches the Doris service handler**. Combined with raw TCP to `:8060` being OPEN, the stall is between TCP-accept and service-dispatch -- inside **brpc 1.4.0**, below Doris's load-stream code. Doris's handler is not the stall point; it is never entered. We did **not** pin the exact brpc 1.4.0 line -- it is in the bundled submodule (`thirdparty/vars.sh`, `apache/brpc` tag `1.4.0`), and the runtime probe that would pin it (brpc `rpcz` / socket bvars) could not be enabled at runtime on this build. One secondary, non-root nuance we found: `FailureDetectChannel` invalidates a cached channel only on `EHOSTDOWN`, not on a timeout (`brpc_client_cache.h:80,125`) -- but we captured 249 `EHOSTDOWN` (`Host is down`) and channel rebuilds happened anyway and did not recover the wedge, so that is at most a hardening suggestion, not the cause. ### What we have captured and what else we can provide We have captured the in-process state at a live wedge. Attached: - `gstack` thread dumps of `doris_be` on the two BEs with parked write threads (full ~1747-thread dumps), - brpc `/vars` (94 KB) and `/metrics` from each, showing the worker / load-channel / compaction state, - `be.WARNING` tails with the `[E1008]` open failures and the `FailureDetectChannel` probe failures. We could not get `rpcz` -- it is off by default and `:8060/rpcz/enable` did not enable it at runtime on this build. If there is a flag or build option to turn rpcz on, tell us and we will capture it. We can also pull a full `gdb -p` `thread apply all bt`, more specific brpc `bvar`s, or FE-side state on request. One caveat on timing. This is a dev POC and we are moving on with our implementation, so the cluster will not stay up indefinitely. The reproducer, the captures above, and any candidate-build testing are only available while the cluster is still running -- so the sooner we can act on this, the better. ### Questions for the maintainers 1. Our evidence says the `open_load_stream` RPC never reaches the server handler (0 handler-entry logs, idle pools) while raw TCP to `:8060` is open -- consistent with a brpc 1.4.0 socket/stream that is accepted at TCP but never dispatched, and never revived (the brpc #1168 class). Is this a known brpc 1.4.0 defect on the load-stream path, and is there a fixing PR or a brpc version that resolves it? 2. Is the load-stream `brpc_client_cache` expected to revive a Broken socket automatically? In our capture it never did, and `enable_brpc_connection_check=true` did not help. Is that the intended recovery path, and should it have recovered the socket? 3. Is there a supported config that makes a Broken load-stream socket fail fast and reconnect, rather than park ~534s on the open RPC? 4. Is there evidence that 4.1.x (4.1.2 specifically) contains a relevant brpc / load-stream fix? We will run the upgrade test -- restart one BE, drive the load sequence, watch the canary -- and report back. ### Appendix -- config be.conf (stock defaults plus these managed overrides only): ```text JAVA_HOME = /usr/lib/jvm/java-17-amazon-corretto.x86_64 storage_root_path = /var/lib/doris/storage priority_networks = <node_private_ip>/32 mem_limit = 80% be_port = 9060 # shipped default webserver_port = 8040 # shipped default heartbeat_service_port = 9050 # shipped default brpc_port = 8060 # shipped default # added later as hardening; did NOT fix the wedge: enable_brpc_connection_check = true ``` fe.conf (stock defaults plus these overrides only): ```text JAVA_HOME = /usr/lib/jvm/java-17-amazon-corretto.x86_64 meta_dir = /var/lib/doris/fe-meta priority_networks = <node_private_ip>/32 http_port = 8030 # shipped default rpc_port = 9020 # shipped default query_port = 9030 # shipped default edit_log_port = 9010 # shipped default # FE global var set at runtime; mitigates cause #2, not cause #3: experimental_enable_single_replica_insert = true ``` Table shape (representative): ```sql CREATE TABLE evo_persons ( identity_hash varchar(32) NOT NULL, id_numbers_hash varchar(32) NOT NULL, ... -- aggregated attribute and counter columns ) UNIQUE KEY(identity_hash, id_numbers_hash) DISTRIBUTED BY HASH(identity_hash) BUCKETS 16 PROPERTIES ('replication_num'='3', 'enable_unique_key_merge_on_write'='true'); ``` Ports: | Port | Service | At the wedge | | ---- | ------- | ------------ | | 8060 | brpc (tablet-writer / load-stream OPEN) | timed out, all directions including loopback | | 8040 | webserver (clone snapshot download) | timed out until our security-group fix (cause #1); fine after | | 9050 | Thrift heartbeat (separate threadpool) | stayed responsive, so `SHOW BACKENDS` showed Alive=true | [wedge.10.0.0.105.tar.gz](https://github.com/user-attachments/files/29211906/wedge.10.0.0.105.tar.gz) [wedge.10.0.0.118.tar.gz](https://github.com/user-attachments/files/29211904/wedge.10.0.0.118.tar.gz) [wedge.10.0.0.155.tar.gz](https://github.com/user-attachments/files/29211903/wedge.10.0.0.155.tar.gz) [wedge.10.0.0.229.tar.gz](https://github.com/user-attachments/files/29211905/wedge.10.0.0.229.tar.gz) ### Are you willing to submit PR? - [ ] Yes I am willing to submit a PR! ### Code of Conduct - [x] I agree to follow this project's [Code of Conduct](https://www.apache.org/foundation/policies/conduct) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
