liaoxin01 opened a new pull request, #64476:
URL: https://github.com/apache/doris/pull/64476
### What problem does this PR solve?
Issue Number: close #xxx
Related PR: #60920 (previous attempt, superseded by this stateless
implementation)
Problem Summary:
When loading CSV data, every column is read as a nullable string, so
`_deserialize_nullable_string` is the per-row per-column hot path (ClickBench:
105 columns x 100M rows = ~10.5 billion cells). Flame graph shows two major
per-cell overheads:
1. `assert_cast<ColumnNullable&>` performs a typeid comparison per cell in
release builds.
2. `DataTypeStringSerDe::deserialize_one_cell_from_csv` adds a call layer
with another per-cell `assert_cast<ColumnString&>` inside, plus Status
plumbing. Its fill-null-on-failure branch is dead code since the method never
fails.
### Changes
1. Use `assert_cast<..., TypeCheckOnRelease::DISABLE>` in
`CsvReader::_deserialize_nullable_string` and
`TextReader::_deserialize_nullable_string`, which compiles to a plain
`static_cast` in release builds. Debug builds still verify the cast.
2. Write the string column and null map directly instead of going through
the SerDe layer (semantically identical, verified against
`ColumnNullable::insert_data` / `DataTypeStringSerDe` implementations). The
virtual `_deserialize_nullable_string` dispatch is kept, so TextReader's
hive-text semantics (different escape handling and null detection) remain
intact.
3. Add `_reserve_nullable_string_columns`, called once per batch: it
performs checked `assert_cast`s (backing the unchecked per-row casts with a
real type validation per batch, throwing instead of UB on mismatch) and
reserves offsets/null_map capacity to avoid incremental PODArray growth in the
row loop.
The implementation is stateless: no cached column pointers, no per-batch
member state to initialize/clear.
### Performance
A/B test on full ClickBench dataset (73GB / 100M rows / 105 columns),
identical deployment and config, only the BE binary differs:
| Metric | Before | After | Improvement |
|---|---|---|---|
| Total load time (BE LoadTime) | 636.6s | 530.9s | -16.6% (1.20x) |
| CSV parse (ReadDataTime) | 590.6s | 484.5s | -18.0% |
| Avg throughput | 115 MB/s | 138 MB/s | +20% |
All 10 splits (10M rows each) improved consistently by 14-18% with small
variance. Loaded row counts are identical between the two runs (99,997,497
rows).
### Release note
None
### Check List (For Author)
- Test
- [ ] Regression test
- [ ] Unit Test
- [x] Manual test (add detailed scripts or steps below)
- Full ClickBench load A/B test, see Performance section above.
Behavioral equivalence is covered by existing CSV/text load regression cases.
- [ ] No need to test or manual test. Explain why:
- [ ] This is a refactor/code format and no logic has been changed.
- [ ] Previous test can cover this change.
- [ ] No code files have been changed.
- [ ] Other reason
- Behavior changed:
- [x] No.
- [ ] Yes.
- Does this need documentation?
- [x] No.
- [ ] Yes.
### Check List (For Reviewer who merge this PR)
- [ ] Confirm the release note
- [ ] Confirm test cases
- [ ] Confirm document
- [ ] Add branch pick label
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]