[PR] [opt](csv reader) optimize nullable string deserialization in CSV/text load hot path [doris]

via GitHub Fri, 12 Jun 2026 09:13:41 -0700


liaoxin01 opened a new pull request, #64476:
URL: https://github.com/apache/doris/pull/64476


   ### What problem does this PR solve?
   
   Issue Number: close #xxx
   
   Related PR: #60920 (previous attempt, superseded by this stateless 
implementation)
   
   Problem Summary:
   
   When loading CSV data, every column is read as a nullable string, so 
`_deserialize_nullable_string` is the per-row per-column hot path (ClickBench: 
105 columns x 100M rows = ~10.5 billion cells). Flame graph shows two major 
per-cell overheads:
   
   1. `assert_cast<ColumnNullable&>` performs a typeid comparison per cell in 
release builds.
   2. `DataTypeStringSerDe::deserialize_one_cell_from_csv` adds a call layer 
with another per-cell `assert_cast<ColumnString&>` inside, plus Status 
plumbing. Its fill-null-on-failure branch is dead code since the method never 
fails.
   
   ### Changes
   
   1. Use `assert_cast<..., TypeCheckOnRelease::DISABLE>` in 
`CsvReader::_deserialize_nullable_string` and 
`TextReader::_deserialize_nullable_string`, which compiles to a plain 
`static_cast` in release builds. Debug builds still verify the cast.
   2. Write the string column and null map directly instead of going through 
the SerDe layer (semantically identical, verified against 
`ColumnNullable::insert_data` / `DataTypeStringSerDe` implementations). The 
virtual `_deserialize_nullable_string` dispatch is kept, so TextReader's 
hive-text semantics (different escape handling and null detection) remain 
intact.
   3. Add `_reserve_nullable_string_columns`, called once per batch: it 
performs checked `assert_cast`s (backing the unchecked per-row casts with a 
real type validation per batch, throwing instead of UB on mismatch) and 
reserves offsets/null_map capacity to avoid incremental PODArray growth in the 
row loop.
   
   The implementation is stateless: no cached column pointers, no per-batch 
member state to initialize/clear.
   
   ### Performance
   
   A/B test on full ClickBench dataset (73GB / 100M rows / 105 columns), 
identical deployment and config, only the BE binary differs:
   
   | Metric | Before | After | Improvement |
   |---|---|---|---|
   | Total load time (BE LoadTime) | 636.6s | 530.9s | -16.6% (1.20x) |
   | CSV parse (ReadDataTime) | 590.6s | 484.5s | -18.0% |
   | Avg throughput | 115 MB/s | 138 MB/s | +20% |
   
   All 10 splits (10M rows each) improved consistently by 14-18% with small 
variance. Loaded row counts are identical between the two runs (99,997,497 
rows).
   
   ### Release note
   
   None
   
   ### Check List (For Author)
   
   - Test
       - [ ] Regression test
       - [ ] Unit Test
       - [x] Manual test (add detailed scripts or steps below)
           - Full ClickBench load A/B test, see Performance section above. 
Behavioral equivalence is covered by existing CSV/text load regression cases.
       - [ ] No need to test or manual test. Explain why:
           - [ ] This is a refactor/code format and no logic has been changed.
           - [ ] Previous test can cover this change.
           - [ ] No code files have been changed.
           - [ ] Other reason
   
   - Behavior changed:
       - [x] No.
       - [ ] Yes.
   
   - Does this need documentation?
       - [x] No.
       - [ ] Yes.
   
   ### Check List (For Reviewer who merge this PR)
   
   - [ ] Confirm the release note
   - [ ] Confirm test cases
   - [ ] Confirm document
   - [ ] Add branch pick label


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[PR] [opt](csv reader) optimize nullable string deserialization in CSV/text load hot path [doris]

Reply via email to