This is an automated email from the ASF dual-hosted git repository.
yiguolei pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/doris.git
The following commit(s) were added to refs/heads/master by this push:
new bb628eb8ca1 [fix](arrow-flight) Harden split source error path to
avoid BE crash on external table scan (#64797)
bb628eb8ca1 is described below
commit bb628eb8ca17cdd9c81dac38ad8ae4288a6f8af8
Author: Mingyu Chen (Rayner) <[email protected]>
AuthorDate: Fri Jun 26 16:50:52 2026 +0800
[fix](arrow-flight) Harden split source error path to avoid BE crash on
external table scan (#64797)
### What problem does this PR solve?
Issue Number: ref #62259 (partial — robustness layer only, does not
close the issue)
Related PR: N/A
Problem Summary:
When an Arrow Flight SQL query scans an external table (e.g. Iceberg) in
batch split mode, the BE fetches file splits from the FE via the
`fetchSplitBatch` thrift RPC *during* the scan. If that fetch fails —
most notably when the split source has already been released — the error
path could crash the BE (SIGSEGV in
`arrow::flight::internal::TransportStatus::FromStatus`) instead of
failing the query gracefully (see #62259).
This PR is the **robustness layer** for that issue: it ensures any
`fetchSplitBatch` failure makes the query fail gracefully rather than
crashing the BE. It does **not** fix the underlying split source
lifecycle problem (the source being released after `GetFlightInfo` but
before `DoGet` on the Arrow Flight two-phase path), which is tracked
separately in the issue.
Changes:
1. **BE `split_source_connector`** — guard `result.status.error_msgs[0]`
with an `empty()` check to avoid an out-of-bounds vector read when the
FE returns a non-OK status without an error message.
2. **BE `to_arrow_status`** — truncate the error message handed to
Arrow/gRPC to a length well below 8192. The message is carried in the
gRPC trailer (an HTTP2 header) and may be percent-encoded, so an
oversized one can break the response or crash the flight transport
status conversion. The 8192 limit was already documented in a comment in
this function but was never enforced. The full message is still logged
on the BE.
3. **FE `fetchSplitBatch`** — when the split source has been released,
return a structured `TStatus(NOT_FOUND)` with a message instead of
throwing a bare `TException`. The BE then receives a well-formed,
non-empty error through the normal result path instead of a thrift
transport exception.
### Release note
Fix a BE crash (SIGSEGV) that could happen on the error path of Arrow
Flight SQL queries against external tables when fetching split batches
from the FE fails.
### Check List (For Author)
- Test
- [ ] Regression test
- [ ] Unit Test
- [ ] Manual test (add detailed scripts or steps below)
- [x] No need to test or manual test. Explain why:
- [ ] This is a refactor/code format and no logic has been changed.
- [ ] Previous test can cover this change.
- [ ] No code files have been changed.
- [x] Other reason: This is defensive hardening of an error path (an
out-of-bounds guard, a message-length cap, and returning a structured
error status instead of throwing). It only triggers when
`fetchSplitBatch` fails; existing tests cover the success path, and the
crash depends on Arrow/gRPC transport internals that are hard to
reproduce deterministically in CI.
- Behavior changed:
- [x] Yes. On `fetchSplitBatch` failure the query now fails with a clear
error message instead of (potentially) crashing the BE, and the FE no
longer throws a bare `TException` for a released split source (it
returns a `NOT_FOUND` status).
- Does this need documentation?
- [x] No.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
---------
Co-authored-by: Claude Opus 4.8 (1M context) <[email protected]>
---
be/src/exec/scan/split_source_connector.cpp | 4 +++-
be/src/format/arrow/arrow_utils.cpp | 12 +++++++++++-
.../java/org/apache/doris/service/FrontendServiceImpl.java | 10 +++++++++-
3 files changed, 23 insertions(+), 3 deletions(-)
diff --git a/be/src/exec/scan/split_source_connector.cpp
b/be/src/exec/scan/split_source_connector.cpp
index 81748261d98..685a2d50f0c 100644
--- a/be/src/exec/scan/split_source_connector.cpp
+++ b/be/src/exec/scan/split_source_connector.cpp
@@ -59,7 +59,9 @@ Status RemoteSplitSourceConnector::get_next(bool* has_next,
TFileRangeDesc* rang
coord->fetchSplitBatch(result, request);
if (result.__isset.status && result.status.status_code !=
TStatusCode::OK) {
return Status::IOError<false>("Failed to get batch of split
source: {}",
- result.status.error_msgs[0]);
+ result.status.error_msgs.empty()
+ ? "unknown error"
+ :
result.status.error_msgs[0]);
}
} catch (std::exception& e) {
return Status::IOError<false>("Failed to get batch of split
source: {}", e.what());
diff --git a/be/src/format/arrow/arrow_utils.cpp
b/be/src/format/arrow/arrow_utils.cpp
index 724af28be39..d9c29eceed0 100644
--- a/be/src/format/arrow/arrow_utils.cpp
+++ b/be/src/format/arrow/arrow_utils.cpp
@@ -38,7 +38,17 @@ arrow::Status to_arrow_status(const Status& status) {
// The length of exception msg returned to the ADBC Client cannot
larger than 8192,
// otherwise ADBC Client will receive:
// `INTERNAL: http2 exception Header size exceeded max allowed size
(8192)`.
- return arrow::Status::Invalid(status.to_string_no_stack());
+ // The message is carried in the gRPC trailer (an HTTP2 header) and
may be
+ // percent-encoded (which can expand its size), so an oversized
message can break the
+ // response or even crash the flight transport status conversion.
Truncate it well below
+ // 8192 to leave headroom; the full message is already logged above.
+ constexpr size_t kMaxArrowStatusMsgLen = 1024;
+ std::string msg = status.to_string_no_stack();
+ if (msg.size() > kMaxArrowStatusMsgLen) {
+ msg.resize(kMaxArrowStatusMsgLen);
+ msg += "... (truncated)";
+ }
+ return arrow::Status::Invalid(msg);
}
}
diff --git
a/fe/fe-core/src/main/java/org/apache/doris/service/FrontendServiceImpl.java
b/fe/fe-core/src/main/java/org/apache/doris/service/FrontendServiceImpl.java
index fb74d7ca29a..cf4e86f8069 100644
--- a/fe/fe-core/src/main/java/org/apache/doris/service/FrontendServiceImpl.java
+++ b/fe/fe-core/src/main/java/org/apache/doris/service/FrontendServiceImpl.java
@@ -1046,7 +1046,15 @@ public class FrontendServiceImpl implements
FrontendService.Iface {
SplitSource splitSource =
Env.getCurrentEnv().getSplitSourceManager().getSplitSource(request.getSplitSourceId());
if (splitSource == null) {
- throw new TException("Split source " + request.getSplitSourceId()
+ " is released");
+ // Return a structured error status instead of throwing a bare
TException, so the BE
+ // receives a well-formed error (with a non-empty message) through
the normal result
+ // path. Throwing here surfaces as a thrift transport exception on
the BE side and,
+ // on the Arrow Flight path, may feed an empty/invalid error
string into the gRPC
+ // status conversion. See
https://github.com/apache/doris/issues/62259
+ LOG.warn("split source {} is released",
request.getSplitSourceId());
+ result.status = new TStatus(TStatusCode.NOT_FOUND);
+ result.status.addToErrorMsgs("Split source " +
request.getSplitSourceId() + " is released");
+ return result;
}
try {
List<TScanRangeLocations> locations =
splitSource.getNextBatch(request.getMaxNumSplits());
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]