github-actions[bot] commented on code in PR #63001:
URL: https://github.com/apache/doris/pull/63001#discussion_r3255091712
##########
be/src/core/column/column_nullable.cpp:
##########
@@ -113,7 +164,14 @@ void ColumnNullable::update_crc32c_batch(uint32_t*
__restrict hashes,
const auto* __restrict real_null_data =
get_null_map_column().get_data().data();
if (_nested_column->support_replace_column_null_data()) {
// nullmap process is slow, replace null data to default value to
avoid nullmap process
-
_nested_column->assume_mutable()->replace_column_null_data(real_null_data);
+ // This is an intentional in-place mutation inside a logically-const
hash computation:
+ // null positions are overwritten with defaults so the inner hash loop
needs no null checks.
+ // The invariant is that a column instance is not hashed concurrently
through the same
+ // owner while this per-block hash path runs. Shared aliases are
detached by mutate()
+ // before this normalized nested column is written back.
+ auto nested_mut = std::move(*static_cast<const
IColumn::Ptr&>(_nested_column)).mutate();
+ nested_mut->replace_column_null_data(real_null_data);
+ static_cast<IColumn::Ptr&>(const_cast<WrappedPtr&>(_nested_column)) =
std::move(nested_mut);
_nested_column->update_crc32c_batch(hashes, nullptr);
} else {
Review Comment:
This still violates top-level COW semantics for shared nullable columns.
`update_crc32c_batch()` is `const`, but it takes the nested pointer from the
shared parent and calls the rvalue `mutate()` without first detaching the
`ColumnNullable` object itself. If two `ColumnPtr`s share the same
`ColumnNullable`, the nested column's refcount can still be 1 because it is
only owned by that shared parent object, so this path mutates the nested data
in place and then writes it back through `const_cast`, making the normalization
visible through all aliases of the nullable column. Hashing one shared block
can therefore change another alias's null-position payloads. Please avoid
mutating column storage from this const hash path, or require callers to pass
an already detached mutable nullable owner before replacing null data.
##########
be/src/format/orc/vorc_reader.cpp:
##########
@@ -2172,45 +2210,64 @@ Status OrcReader::_orc_column_to_doris_column(
resolved_column = converter->get_column(src_type, doris_column,
data_type);
resolved_type = converter->get_type();
- if (resolved_column->is_nullable()) {
+ MutableColumnPtr mutable_resolved_column;
+ if (converter->is_consistent()) {
+ resolved_column.reset();
+ mutable_resolved_column = IColumn::mutate(std::move(doris_column));
+ } else {
+ mutable_resolved_column =
IColumn::mutate(std::move(resolved_column));
+ }
+
+ size_t src_null_map_start = 0;
+ if (mutable_resolved_column->is_nullable()) {
SCOPED_RAW_TIMER(&_statistics.decode_null_map_time);
auto* nullable_column =
-
reinterpret_cast<ColumnNullable*>(resolved_column->assume_mutable().get());
+
reinterpret_cast<ColumnNullable*>(mutable_resolved_column.get());
data_column = nullable_column->get_nested_column_ptr();
-
- NullMap& map_data_column = nullable_column->get_null_map_data();
- auto origin_size = map_data_column.size();
- map_data_column.resize(origin_size + num_values);
- if (cvb->hasNulls) {
- const auto* cvb_nulls = cvb->notNull.data();
- for (int i = 0; i < num_values; ++i) {
- map_data_column[origin_size + i] = !cvb_nulls[i];
- }
- } else {
- memset(map_data_column.data() + origin_size, 0, num_values);
- }
+ src_null_map_start = nullable_column->get_null_map_column().size();
+ fill_orc_null_map(nullable_column, cvb, num_values);
} else {
if (cvb->hasNulls) {
return Status::InternalError("Not nullable column {} has null
values in orc file",
col_name);
}
- data_column = resolved_column->assume_mutable();
+ data_column = std::move(mutable_resolved_column);
}
RETURN_IF_ERROR(_fill_doris_data_column<is_filter>(
col_name, data_column, remove_nullable(resolved_type),
root_node, orc_column_type,
cvb, num_values));
- // resolve schema change
+
Review Comment:
When `converter->is_consistent()`, line 2216 has already moved
`doris_column` into `mutable_resolved_column`, but `_fill_doris_data_column()`
can fail here before line 2249 restores it. A corrupt/unsupported nested ORC
value would then return an error with the caller's column pointer left
moved-from. This is separate from the existing `Block::mutate_columns()` review
thread because the move/restore gap is local to ORC column materialization.
Please restore the column on error or defer moving out of `doris_column` until
after the failing work is complete.
##########
be/src/format/parquet/parquet_column_convert.h:
##########
@@ -254,26 +335,46 @@ class PhysicalToLogicalConverter {
PrimitiveType::TYPE_INT, dst_logical_type->is_nullable());
}
if (is_consistent() && _logical_converter->is_consistent()) {
+ dst_logical_col = std::move(src_physical_col);
+ return Status::OK();
+ }
+ if (_logical_converter->is_consistent()) {
+ const size_t old_rows =
get_mutable_inner_column_size(dst_logical_col);
+ const size_t old_null_map_size =
+ get_null_map_size_or_inner_column_size(dst_logical_col);
+ RETURN_IF_ERROR(physical_convert(src_physical_col,
dst_logical_col));
+ const size_t new_rows =
get_mutable_inner_column_size(dst_logical_col) - old_rows;
+ align_null_map(src_physical_col, dst_logical_col,
old_null_map_size, new_rows,
+ get_appended_null_map_start(src_physical_col,
new_rows));
return Status::OK();
}
+
ColumnPtr src_logical_column;
if (is_consistent()) {
- if (dst_logical_type->is_nullable()) {
- auto doris_nullable_column =
- assert_cast<const
ColumnNullable*>(dst_logical_col.get());
- src_logical_column =
- ColumnNullable::create(_cached_src_physical_column,
-
doris_nullable_column->get_null_map_column_ptr());
- } else {
- src_logical_column = _cached_src_physical_column;
- }
+ src_logical_column = src_physical_col;
} else {
src_logical_column =
_logical_converter->get_column(src_logical_type, dst_logical_col,
dst_logical_type);
}
+ const size_t src_old_rows =
get_mutable_inner_column_size(src_logical_column);
+ const size_t src_old_null_map_size =
+ get_null_map_size_or_inner_column_size(src_logical_column);
RETURN_IF_ERROR(physical_convert(src_physical_col,
src_logical_column));
+ const size_t src_new_rows =
+ get_mutable_inner_column_size(src_logical_column) -
src_old_rows;
+ align_null_map(src_physical_col, src_logical_column,
src_old_null_map_size, src_new_rows,
+ get_appended_null_map_start(src_physical_col,
src_new_rows));
+
+ dst_logical_col = IColumn::mutate(std::move(dst_logical_col));
+ const size_t dst_old_rows =
get_mutable_inner_column_size(dst_logical_col);
+ const size_t dst_old_null_map_size =
+ get_null_map_size_or_inner_column_size(dst_logical_col);
auto converted_column = dst_logical_col->assume_mutable();
- return _logical_converter->convert(src_logical_column,
converted_column);
+ RETURN_IF_ERROR(_logical_converter->convert(src_logical_column,
converted_column));
+ const size_t dst_new_rows =
get_mutable_inner_column_size(dst_logical_col) - dst_old_rows;
+ align_null_map(src_logical_column, dst_logical_col,
dst_old_null_map_size, dst_new_rows,
+ get_appended_null_map_start(src_logical_column,
dst_new_rows));
+ return Status::OK();
Review Comment:
The destination null map is aligned only after
`_logical_converter->convert()`, but some logical converters write into the
destination null map during conversion. For example, string-to-number/date
conversion marks failed casts with `(*null_map)[start_idx + i] = 1`; at this
point `dst_logical_col` has only the old null-map rows, while the nested data
column has just been resized/appended. A nullable Parquet schema-change cast
with a bad string can therefore write past the current null map or fail to mark
the appended row as null. ORC aligns the destination null map before calling
`convert()`; this path should do the same (or otherwise resize the nullable
destination before converters can write null flags).
##########
be/src/exec/scan/meta_scanner.cpp:
##########
@@ -112,21 +112,14 @@ Status MetaScanner::_get_block_impl(RuntimeState* state,
Block* block, bool* eof
columns.resize(column_size);
for (auto i = 0; i < column_size; i++) {
if (mem_reuse) {
- columns[i] =
block->get_by_position(i).column->assume_mutable();
+ columns[i] =
IColumn::mutate(std::move(block->get_by_position(i).column));
} else {
columns[i] =
_tuple_desc->slots()[i]->get_empty_mutable_column();
}
}
Review Comment:
This is another distinct caller left unsafe by the new stealing
`IColumn::mutate(std::move(...))` pattern: with `mem_reuse`, the block's
columns are moved out before `_fill_block_with_remote_data(columns)`, and the
`RETURN_IF_ERROR` below skips `block->set_columns(...)`.
`_fill_block_with_remote_data()` has a real failure path for unsupported slot
types, so the caller can receive an error with the reusable output block
containing null/moved-from columns. Please use `mutate_columns_scoped()`/RAII
or otherwise restore the columns before returning errors.
##########
be/src/format/parquet/vparquet_column_reader.cpp:
##########
@@ -550,6 +548,10 @@ Status ScalarColumnReader<IN_COLLECTION,
OFFSET_INDEX>::read_column_data(
ColumnPtr resolved_column =
_converter->get_physical_column(_field_schema->physical_type,
_field_schema->data_type,
doris_column, type,
is_dict_filter);
+ if (_converter->read_directly_into_dst_logical_column()) {
+ DCHECK_EQ(resolved_column.get(), doris_column.get());
+ resolved_column = std::move(doris_column);
+ }
DataTypePtr& resolved_type = _converter->get_physical_type();
_def_levels.clear();
Review Comment:
In the direct-read path this moves the caller's `doris_column` into
`resolved_column`, but `_read_nested_column()` below can return an error before
`_converter->convert()` moves it back. That leaves the caller-owned `ColumnPtr`
empty on a non-OK return, which is a new moved-from error path distinct from
the existing `Block::mutate_columns()` callers: it happens inside Parquet
scalar nested reading itself. Please restore `doris_column` on all exits (for
example with a scoped restore) or avoid moving it out before error-producing
work completes.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]