github-actions[bot] commented on code in PR #60920:
URL: https://github.com/apache/doris/pull/60920#discussion_r2919355604
##########
be/src/format/csv/csv_reader.cpp:
##########
@@ -662,11 +690,35 @@ Status CsvReader::_fill_dest_columns(const Slice& line,
Block* block,
}
if (_use_nullable_string_opt[i]) {
- // For load task, we always read "string" from file.
- // So serdes[i] here must be DataTypeNullableSerDe, and
DataTypeNullableSerDe -> nested_serde must be DataTypeStringSerDe.
- // So we use deserialize_nullable_string and stringSerDe to reduce
virtual function calls.
- RETURN_IF_ERROR(_deserialize_nullable_string(*col_ptr, value));
+ // Inline fast path: bypass StringSerDe and per-row assert_cast
entirely.
+ DCHECK_LT(i, _nullable_str_col_cache.size());
+ auto& cache = _nullable_str_col_cache[i];
+ if (_empty_field_as_null && value.size == 0) {
+ cache.nested_str_col->insert_default();
+ cache.null_map->push_back(1);
+ continue;
+ }
+ if (_options.null_len > 0 &&
+ !(_options.converted_from_string &&
value.trim_double_quotes())) {
+ if (value.compare(Slice(_options.null_format,
_options.null_len)) == 0) {
+ cache.nested_str_col->insert_default();
+ cache.null_map->push_back(1);
+ continue;
+ }
+ }
+ if (UNLIKELY(_has_escape_char)) {
+ escape_string_for_csv(value.data, &value.size,
_options.escape_char,
+ _options.quote_char);
+ }
+ cache.nested_str_col->insert_data(value.data, value.size);
+ cache.null_map->push_back(0);
} else {
+ // Non-optimized path: col_ptr only needed here since fast path
uses cached pointers
+ IColumn* col_ptr = columns[i].get();
Review Comment:
**Issue 1 (Performance/Code Quality): Dead `col_ptr` computation + variable
shadowing**
The original `col_ptr` assignment at lines 684-690 (outside this hunk,
between hunk 2 and hunk 3 in the diff) was NOT removed by this PR. It is still
computed on **every row for every column**, but its value is never used:
- The fast path (lines 692-714) uses cached pointers from
`_nullable_str_col_cache` instead.
- This `else` block declares a **new** local `col_ptr` (shadowing the outer
one) and recomputes it.
This means every row pays the cost of computing `col_ptr` (including the
`block->get_by_position()` call in the `!_is_load` path) even though it's dead
code. For a performance optimization PR, this is counterproductive.
**Fix:** Remove the `col_ptr` assignment block at lines 684-690 entirely
(the block currently between `auto value = ...` and `if
(_use_nullable_string_opt[i])`). The `else` branch here already computes its
own `col_ptr`.
##########
be/src/format/csv/csv_reader.cpp:
##########
@@ -640,6 +641,33 @@ Status CsvReader::_fill_dest_columns(const Slice& line,
Block* block,
std::vector<MutableColumnPtr>& columns,
size_t* rows) {
bool is_success = false;
+ // Initialize cached column pointers on first call per batch to avoid
per-row assert_cast.
+ // Also pre-reserve capacity for offsets/chars/null_map to eliminate
realloc during row loop.
+ if (UNLIKELY(_nullable_str_col_cache.empty())) {
+ _nullable_str_col_cache.resize(_file_slot_descs.size());
+ _has_escape_char = (_options.escape_char != 0);
+ const size_t batch_size = std::max(_state->batch_size(),
(int)_MIN_BATCH_SIZE);
Review Comment:
**Nit: C-style cast on `_MIN_BATCH_SIZE`**
```cpp
const size_t batch_size = std::max(_state->batch_size(),
(int)_MIN_BATCH_SIZE);
```
`_MIN_BATCH_SIZE` is `const size_t` (defined in `generic_reader.h`). The
C-style `(int)` cast is a narrowing conversion from `size_t` to `int`. While
4064 fits in `int`, this pattern is fragile and would trigger warnings under
`-Wconversion` (which `compile_check_begin.h` elevates to errors).
Consider:
```cpp
const size_t batch_size =
std::max(static_cast<size_t>(_state->batch_size()), _MIN_BATCH_SIZE);
```
Or if you want to guard against negative `batch_size()` returns:
```cpp
const size_t batch_size = std::max<size_t>(_state->batch_size(),
_MIN_BATCH_SIZE);
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]