This is an automated email from the ASF dual-hosted git repository.

yiguolei pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/doris.git


The following commit(s) were added to refs/heads/master by this push:
     new bb628eb8ca1 [fix](arrow-flight) Harden split source error path to 
avoid BE crash on external table scan (#64797)
bb628eb8ca1 is described below

commit bb628eb8ca17cdd9c81dac38ad8ae4288a6f8af8
Author: Mingyu Chen (Rayner) <[email protected]>
AuthorDate: Fri Jun 26 16:50:52 2026 +0800

    [fix](arrow-flight) Harden split source error path to avoid BE crash on 
external table scan (#64797)
    
    ### What problem does this PR solve?
    
    Issue Number: ref #62259 (partial — robustness layer only, does not
    close the issue)
    
    Related PR: N/A
    
    Problem Summary:
    
    When an Arrow Flight SQL query scans an external table (e.g. Iceberg) in
    batch split mode, the BE fetches file splits from the FE via the
    `fetchSplitBatch` thrift RPC *during* the scan. If that fetch fails —
    most notably when the split source has already been released — the error
    path could crash the BE (SIGSEGV in
    `arrow::flight::internal::TransportStatus::FromStatus`) instead of
    failing the query gracefully (see #62259).
    
    This PR is the **robustness layer** for that issue: it ensures any
    `fetchSplitBatch` failure makes the query fail gracefully rather than
    crashing the BE. It does **not** fix the underlying split source
    lifecycle problem (the source being released after `GetFlightInfo` but
    before `DoGet` on the Arrow Flight two-phase path), which is tracked
    separately in the issue.
    
    Changes:
    
    1. **BE `split_source_connector`** — guard `result.status.error_msgs[0]`
    with an `empty()` check to avoid an out-of-bounds vector read when the
    FE returns a non-OK status without an error message.
    2. **BE `to_arrow_status`** — truncate the error message handed to
    Arrow/gRPC to a length well below 8192. The message is carried in the
    gRPC trailer (an HTTP2 header) and may be percent-encoded, so an
    oversized one can break the response or crash the flight transport
    status conversion. The 8192 limit was already documented in a comment in
    this function but was never enforced. The full message is still logged
    on the BE.
    3. **FE `fetchSplitBatch`** — when the split source has been released,
    return a structured `TStatus(NOT_FOUND)` with a message instead of
    throwing a bare `TException`. The BE then receives a well-formed,
    non-empty error through the normal result path instead of a thrift
    transport exception.
    
    ### Release note
    
    Fix a BE crash (SIGSEGV) that could happen on the error path of Arrow
    Flight SQL queries against external tables when fetching split batches
    from the FE fails.
    
    ### Check List (For Author)
    
    - Test
        - [ ] Regression test
        - [ ] Unit Test
        - [ ] Manual test (add detailed scripts or steps below)
        - [x] No need to test or manual test. Explain why:
    - [ ] This is a refactor/code format and no logic has been changed.
            - [ ] Previous test can cover this change.
            - [ ] No code files have been changed.
    - [x] Other reason: This is defensive hardening of an error path (an
    out-of-bounds guard, a message-length cap, and returning a structured
    error status instead of throwing). It only triggers when
    `fetchSplitBatch` fails; existing tests cover the success path, and the
    crash depends on Arrow/gRPC transport internals that are hard to
    reproduce deterministically in CI.
    
    - Behavior changed:
    - [x] Yes. On `fetchSplitBatch` failure the query now fails with a clear
    error message instead of (potentially) crashing the BE, and the FE no
    longer throws a bare `TException` for a released split source (it
    returns a `NOT_FOUND` status).
    
    - Does this need documentation?
        - [x] No.
    
    🤖 Generated with [Claude Code](https://claude.com/claude-code)
    
    ---------
    
    Co-authored-by: Claude Opus 4.8 (1M context) <[email protected]>
---
 be/src/exec/scan/split_source_connector.cpp                  |  4 +++-
 be/src/format/arrow/arrow_utils.cpp                          | 12 +++++++++++-
 .../java/org/apache/doris/service/FrontendServiceImpl.java   | 10 +++++++++-
 3 files changed, 23 insertions(+), 3 deletions(-)

diff --git a/be/src/exec/scan/split_source_connector.cpp 
b/be/src/exec/scan/split_source_connector.cpp
index 81748261d98..685a2d50f0c 100644
--- a/be/src/exec/scan/split_source_connector.cpp
+++ b/be/src/exec/scan/split_source_connector.cpp
@@ -59,7 +59,9 @@ Status RemoteSplitSourceConnector::get_next(bool* has_next, 
TFileRangeDesc* rang
             coord->fetchSplitBatch(result, request);
             if (result.__isset.status && result.status.status_code != 
TStatusCode::OK) {
                 return Status::IOError<false>("Failed to get batch of split 
source: {}",
-                                              result.status.error_msgs[0]);
+                                              result.status.error_msgs.empty()
+                                                      ? "unknown error"
+                                                      : 
result.status.error_msgs[0]);
             }
         } catch (std::exception& e) {
             return Status::IOError<false>("Failed to get batch of split 
source: {}", e.what());
diff --git a/be/src/format/arrow/arrow_utils.cpp 
b/be/src/format/arrow/arrow_utils.cpp
index 724af28be39..d9c29eceed0 100644
--- a/be/src/format/arrow/arrow_utils.cpp
+++ b/be/src/format/arrow/arrow_utils.cpp
@@ -38,7 +38,17 @@ arrow::Status to_arrow_status(const Status& status) {
         // The length of exception msg returned to the ADBC Client cannot 
larger than 8192,
         // otherwise ADBC Client will receive:
         // `INTERNAL: http2 exception Header size exceeded max allowed size 
(8192)`.
-        return arrow::Status::Invalid(status.to_string_no_stack());
+        // The message is carried in the gRPC trailer (an HTTP2 header) and 
may be
+        // percent-encoded (which can expand its size), so an oversized 
message can break the
+        // response or even crash the flight transport status conversion. 
Truncate it well below
+        // 8192 to leave headroom; the full message is already logged above.
+        constexpr size_t kMaxArrowStatusMsgLen = 1024;
+        std::string msg = status.to_string_no_stack();
+        if (msg.size() > kMaxArrowStatusMsgLen) {
+            msg.resize(kMaxArrowStatusMsgLen);
+            msg += "... (truncated)";
+        }
+        return arrow::Status::Invalid(msg);
     }
 }
 
diff --git 
a/fe/fe-core/src/main/java/org/apache/doris/service/FrontendServiceImpl.java 
b/fe/fe-core/src/main/java/org/apache/doris/service/FrontendServiceImpl.java
index fb74d7ca29a..cf4e86f8069 100644
--- a/fe/fe-core/src/main/java/org/apache/doris/service/FrontendServiceImpl.java
+++ b/fe/fe-core/src/main/java/org/apache/doris/service/FrontendServiceImpl.java
@@ -1046,7 +1046,15 @@ public class FrontendServiceImpl implements 
FrontendService.Iface {
         SplitSource splitSource =
                 
Env.getCurrentEnv().getSplitSourceManager().getSplitSource(request.getSplitSourceId());
         if (splitSource == null) {
-            throw new TException("Split source " + request.getSplitSourceId() 
+ " is released");
+            // Return a structured error status instead of throwing a bare 
TException, so the BE
+            // receives a well-formed error (with a non-empty message) through 
the normal result
+            // path. Throwing here surfaces as a thrift transport exception on 
the BE side and,
+            // on the Arrow Flight path, may feed an empty/invalid error 
string into the gRPC
+            // status conversion. See 
https://github.com/apache/doris/issues/62259
+            LOG.warn("split source {} is released", 
request.getSplitSourceId());
+            result.status = new TStatus(TStatusCode.NOT_FOUND);
+            result.status.addToErrorMsgs("Split source " + 
request.getSplitSourceId() + " is released");
+            return result;
         }
         try {
             List<TScanRangeLocations> locations = 
splitSource.getNextBatch(request.getMaxNumSplits());


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to