github-actions[bot] commented on code in PR #63001:
URL: https://github.com/apache/doris/pull/63001#discussion_r3255091712


##########
be/src/core/column/column_nullable.cpp:
##########
@@ -113,7 +164,14 @@ void ColumnNullable::update_crc32c_batch(uint32_t* 
__restrict hashes,
     const auto* __restrict real_null_data = 
get_null_map_column().get_data().data();
     if (_nested_column->support_replace_column_null_data()) {
         // nullmap process is slow, replace null data to default value to 
avoid nullmap process
-        
_nested_column->assume_mutable()->replace_column_null_data(real_null_data);
+        // This is an intentional in-place mutation inside a logically-const 
hash computation:
+        // null positions are overwritten with defaults so the inner hash loop 
needs no null checks.
+        // The invariant is that a column instance is not hashed concurrently 
through the same
+        // owner while this per-block hash path runs. Shared aliases are 
detached by mutate()
+        // before this normalized nested column is written back.
+        auto nested_mut = std::move(*static_cast<const 
IColumn::Ptr&>(_nested_column)).mutate();
+        nested_mut->replace_column_null_data(real_null_data);
+        static_cast<IColumn::Ptr&>(const_cast<WrappedPtr&>(_nested_column)) = 
std::move(nested_mut);
         _nested_column->update_crc32c_batch(hashes, nullptr);
     } else {

Review Comment:
   This still violates top-level COW semantics for shared nullable columns. 
`update_crc32c_batch()` is `const`, but it takes the nested pointer from the 
shared parent and calls the rvalue `mutate()` without first detaching the 
`ColumnNullable` object itself. If two `ColumnPtr`s share the same 
`ColumnNullable`, the nested column's refcount can still be 1 because it is 
only owned by that shared parent object, so this path mutates the nested data 
in place and then writes it back through `const_cast`, making the normalization 
visible through all aliases of the nullable column. Hashing one shared block 
can therefore change another alias's null-position payloads. Please avoid 
mutating column storage from this const hash path, or require callers to pass 
an already detached mutable nullable owner before replacing null data.



##########
be/src/format/orc/vorc_reader.cpp:
##########
@@ -2172,45 +2210,64 @@ Status OrcReader::_orc_column_to_doris_column(
         resolved_column = converter->get_column(src_type, doris_column, 
data_type);
         resolved_type = converter->get_type();
 
-        if (resolved_column->is_nullable()) {
+        MutableColumnPtr mutable_resolved_column;
+        if (converter->is_consistent()) {
+            resolved_column.reset();
+            mutable_resolved_column = IColumn::mutate(std::move(doris_column));
+        } else {
+            mutable_resolved_column = 
IColumn::mutate(std::move(resolved_column));
+        }
+
+        size_t src_null_map_start = 0;
+        if (mutable_resolved_column->is_nullable()) {
             SCOPED_RAW_TIMER(&_statistics.decode_null_map_time);
             auto* nullable_column =
-                    
reinterpret_cast<ColumnNullable*>(resolved_column->assume_mutable().get());
+                    
reinterpret_cast<ColumnNullable*>(mutable_resolved_column.get());
             data_column = nullable_column->get_nested_column_ptr();
-
-            NullMap& map_data_column = nullable_column->get_null_map_data();
-            auto origin_size = map_data_column.size();
-            map_data_column.resize(origin_size + num_values);
-            if (cvb->hasNulls) {
-                const auto* cvb_nulls = cvb->notNull.data();
-                for (int i = 0; i < num_values; ++i) {
-                    map_data_column[origin_size + i] = !cvb_nulls[i];
-                }
-            } else {
-                memset(map_data_column.data() + origin_size, 0, num_values);
-            }
+            src_null_map_start = nullable_column->get_null_map_column().size();
+            fill_orc_null_map(nullable_column, cvb, num_values);
         } else {
             if (cvb->hasNulls) {
                 return Status::InternalError("Not nullable column {} has null 
values in orc file",
                                              col_name);
             }
-            data_column = resolved_column->assume_mutable();
+            data_column = std::move(mutable_resolved_column);
         }
 
         RETURN_IF_ERROR(_fill_doris_data_column<is_filter>(
                 col_name, data_column, remove_nullable(resolved_type), 
root_node, orc_column_type,
                 cvb, num_values));
-        // resolve schema change
+

Review Comment:
   When `converter->is_consistent()`, line 2216 has already moved 
`doris_column` into `mutable_resolved_column`, but `_fill_doris_data_column()` 
can fail here before line 2249 restores it. A corrupt/unsupported nested ORC 
value would then return an error with the caller's column pointer left 
moved-from. This is separate from the existing `Block::mutate_columns()` review 
thread because the move/restore gap is local to ORC column materialization. 
Please restore the column on error or defer moving out of `doris_column` until 
after the failing work is complete.



##########
be/src/format/parquet/parquet_column_convert.h:
##########
@@ -254,26 +335,46 @@ class PhysicalToLogicalConverter {
                     PrimitiveType::TYPE_INT, dst_logical_type->is_nullable());
         }
         if (is_consistent() && _logical_converter->is_consistent()) {
+            dst_logical_col = std::move(src_physical_col);
+            return Status::OK();
+        }
+        if (_logical_converter->is_consistent()) {
+            const size_t old_rows = 
get_mutable_inner_column_size(dst_logical_col);
+            const size_t old_null_map_size =
+                    get_null_map_size_or_inner_column_size(dst_logical_col);
+            RETURN_IF_ERROR(physical_convert(src_physical_col, 
dst_logical_col));
+            const size_t new_rows = 
get_mutable_inner_column_size(dst_logical_col) - old_rows;
+            align_null_map(src_physical_col, dst_logical_col, 
old_null_map_size, new_rows,
+                           get_appended_null_map_start(src_physical_col, 
new_rows));
             return Status::OK();
         }
+
         ColumnPtr src_logical_column;
         if (is_consistent()) {
-            if (dst_logical_type->is_nullable()) {
-                auto doris_nullable_column =
-                        assert_cast<const 
ColumnNullable*>(dst_logical_col.get());
-                src_logical_column =
-                        ColumnNullable::create(_cached_src_physical_column,
-                                               
doris_nullable_column->get_null_map_column_ptr());
-            } else {
-                src_logical_column = _cached_src_physical_column;
-            }
+            src_logical_column = src_physical_col;
         } else {
             src_logical_column = 
_logical_converter->get_column(src_logical_type, dst_logical_col,
                                                                 
dst_logical_type);
         }
+        const size_t src_old_rows = 
get_mutable_inner_column_size(src_logical_column);
+        const size_t src_old_null_map_size =
+                get_null_map_size_or_inner_column_size(src_logical_column);
         RETURN_IF_ERROR(physical_convert(src_physical_col, 
src_logical_column));
+        const size_t src_new_rows =
+                get_mutable_inner_column_size(src_logical_column) - 
src_old_rows;
+        align_null_map(src_physical_col, src_logical_column, 
src_old_null_map_size, src_new_rows,
+                       get_appended_null_map_start(src_physical_col, 
src_new_rows));
+
+        dst_logical_col = IColumn::mutate(std::move(dst_logical_col));
+        const size_t dst_old_rows = 
get_mutable_inner_column_size(dst_logical_col);
+        const size_t dst_old_null_map_size =
+                get_null_map_size_or_inner_column_size(dst_logical_col);
         auto converted_column = dst_logical_col->assume_mutable();
-        return _logical_converter->convert(src_logical_column, 
converted_column);
+        RETURN_IF_ERROR(_logical_converter->convert(src_logical_column, 
converted_column));
+        const size_t dst_new_rows = 
get_mutable_inner_column_size(dst_logical_col) - dst_old_rows;
+        align_null_map(src_logical_column, dst_logical_col, 
dst_old_null_map_size, dst_new_rows,
+                       get_appended_null_map_start(src_logical_column, 
dst_new_rows));
+        return Status::OK();

Review Comment:
   The destination null map is aligned only after 
`_logical_converter->convert()`, but some logical converters write into the 
destination null map during conversion. For example, string-to-number/date 
conversion marks failed casts with `(*null_map)[start_idx + i] = 1`; at this 
point `dst_logical_col` has only the old null-map rows, while the nested data 
column has just been resized/appended. A nullable Parquet schema-change cast 
with a bad string can therefore write past the current null map or fail to mark 
the appended row as null. ORC aligns the destination null map before calling 
`convert()`; this path should do the same (or otherwise resize the nullable 
destination before converters can write null flags).



##########
be/src/exec/scan/meta_scanner.cpp:
##########
@@ -112,21 +112,14 @@ Status MetaScanner::_get_block_impl(RuntimeState* state, 
Block* block, bool* eof
         columns.resize(column_size);
         for (auto i = 0; i < column_size; i++) {
             if (mem_reuse) {
-                columns[i] = 
block->get_by_position(i).column->assume_mutable();
+                columns[i] = 
IColumn::mutate(std::move(block->get_by_position(i).column));
             } else {
                 columns[i] = 
_tuple_desc->slots()[i]->get_empty_mutable_column();
             }
         }

Review Comment:
   This is another distinct caller left unsafe by the new stealing 
`IColumn::mutate(std::move(...))` pattern: with `mem_reuse`, the block's 
columns are moved out before `_fill_block_with_remote_data(columns)`, and the 
`RETURN_IF_ERROR` below skips `block->set_columns(...)`. 
`_fill_block_with_remote_data()` has a real failure path for unsupported slot 
types, so the caller can receive an error with the reusable output block 
containing null/moved-from columns. Please use `mutate_columns_scoped()`/RAII 
or otherwise restore the columns before returning errors.



##########
be/src/format/parquet/vparquet_column_reader.cpp:
##########
@@ -550,6 +548,10 @@ Status ScalarColumnReader<IN_COLLECTION, 
OFFSET_INDEX>::read_column_data(
     ColumnPtr resolved_column =
             _converter->get_physical_column(_field_schema->physical_type, 
_field_schema->data_type,
                                             doris_column, type, 
is_dict_filter);
+    if (_converter->read_directly_into_dst_logical_column()) {
+        DCHECK_EQ(resolved_column.get(), doris_column.get());
+        resolved_column = std::move(doris_column);
+    }
     DataTypePtr& resolved_type = _converter->get_physical_type();
 
     _def_levels.clear();

Review Comment:
   In the direct-read path this moves the caller's `doris_column` into 
`resolved_column`, but `_read_nested_column()` below can return an error before 
`_converter->convert()` moves it back. That leaves the caller-owned `ColumnPtr` 
empty on a non-OK return, which is a new moved-from error path distinct from 
the existing `Block::mutate_columns()` callers: it happens inside Parquet 
scalar nested reading itself. Please restore `doris_column` on all exits (for 
example with a scoped restore) or avoid moving it out before error-producing 
work completes.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to