uchenily opened a new pull request, #60363:
URL: https://github.com/apache/doris/pull/60363

   This PR ensures that VMergeIteratorContext::copy_rows iterates over all 
columns present in the input block by using block->columns() instead of a 
unsafe _num_columns value. This fix prevents column count mismatches when the 
read schema is changed. The data copying logic remains synchronized with the 
actual structure of the block at runtime, regardless of whether the schema has 
been expanded for delete predicates.
   
   Consider the following table:
   
   ```sql
   CREATE TABLE tbl (
     k INT NOT NULL,
     v1 INT NOT NULL,
     v2 INT NOT NULL
   ) DUPLICATE KEY(k) ...;
   ```
   
   And a delete predicate applied to a non-key column:
   
   ```sql
   DELETE FROM tbl WHERE v1 = 1;
   ```
   
   When executing ORDER BY k LIMIT n, Doris has a Top-N optimization. Even 
though the query is SELECT *, the engine initially avoids scanning all columns. 
It constructs a minimal intermediate schema containing only the sort keys (k) 
and the internal `__DORIS_ROWID_COL__` to perform the merge and sorting 
efficiently. (_col_ids = {0, 3}, ==> _num_columns = 2). However, because a 
delete predicate exists on column v1, the BetaRowsetReader add v1 to this 
intermediate schema to evaluate and filter out deleted rows during the scan. 
(_col_ids = {0, 3, 1}, note that column v1 (index=1) is appended to this schema 
==> _num_columns = 3)
   
   The previous implementation of VMergeIteratorContext::copy_rows used the 
incorrect _num_columns value, resulting in an array out-of-bounds access and 
causing BE coredumped.
   
   Detailed reproduction steps are follows:
   
   1. modify conf/be.conf
   
   ```
   write_buffer_size = 8
   ```
   
   2. run sql
   
   ```sql
   CREATE TABLE tbl1
   (
       k INT NOT NULL,
       v1 INT NOT NULL,
       v2 INT NOT NULL
   )
   DUPLICATE KEY(k)
   DISTRIBUTED BY HASH(k) BUCKETS 5
   PROPERTIES(
       "replication_num" = "1"
   
   );
   CREATE TABLE tbl2
   (
       k INT NOT NULL,
       v1 INT NOT NULL,
       v2 INT NOT NULL
   )
   DUPLICATE KEY(k)
   DISTRIBUTED BY HASH(k) BUCKETS 1
   PROPERTIES(
       "replication_num" = "1"
   );
   
   INSERT INTO tbl1 VALUES (1, 1, 1),(2, 2, 2),(3, 3, 3),(4, 4, 4),(5, 5, 5);
   INSERT INTO tbl2 SELECT * FROM tbl1;
   SELECT * FROM tbl2 ORDER BY k limit 100; -- ok
   
   DELETE FROM tbl2 WHERE v1 = 100;
   SELECT * FROM tbl2 ORDER BY k limit 100; -- coredump
   ```
   
   ### What problem does this PR solve?
   
   Issue Number: close #xxx
   
   Related PR: #xxx
   
   Problem Summary:
   
   ### Release note
   
   None
   
   ### Check List (For Author)
   
   - Test <!-- At least one of them must be included. -->
       - [ ] Regression test
       - [ ] Unit Test
       - [ ] Manual test (add detailed scripts or steps below)
       - [ ] No need to test or manual test. Explain why:
           - [ ] This is a refactor/code format and no logic has been changed.
           - [ ] Previous test can cover this change.
           - [ ] No code files have been changed.
           - [ ] Other reason <!-- Add your reason?  -->
   
   - Behavior changed:
       - [ ] No.
       - [ ] Yes. <!-- Explain the behavior change -->
   
   - Does this need documentation?
       - [ ] No.
       - [ ] Yes. <!-- Add document PR link here. eg: 
https://github.com/apache/doris-website/pull/1214 -->
   
   ### Check List (For Reviewer who merge this PR)
   
   - [ ] Confirm the release note
   - [ ] Confirm test cases
   - [ ] Confirm document
   - [ ] Add branch pick label <!-- Add branch pick label that this PR should 
merge into -->
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to