github-actions[bot] commented on code in PR #61888:
URL: https://github.com/apache/doris/pull/61888#discussion_r3013400746
##########
be/src/storage/segment/column_reader.cpp:
##########
@@ -1903,6 +2171,68 @@ Status FileColumnIterator::next_batch(size_t* n,
MutableColumnPtr& dst, bool* ha
Status FileColumnIterator::read_by_rowids(const rowid_t* rowids, const size_t
count,
MutableColumnPtr& dst) {
+ if (read_null_map_only()) {
+ DLOG(INFO) << "File column iterator column " << _column_name
+ << " in NULL_MAP_ONLY mode, reading only null map by
rowids.";
+
+ DORIS_CHECK(dst->is_nullable());
+ auto& nullable_col = assert_cast<ColumnNullable&>(*dst);
+ auto& null_map_data = nullable_col.get_null_map_data();
+ const size_t base_size = null_map_data.size();
+ null_map_data.resize(base_size + count);
+
Review Comment:
**[Medium Bug — Uninitialized Memory]** Same `PODArray::resize()` issue as
in `next_batch`: `null_map_data.resize(base_size + count)` allocates `count`
new entries with uninitialized memory.
Unlike `next_batch`, this `read_by_rowids` path does subsequently fill the
null map via `memset` in the loop below (lines 2215-2228). However, if the
while loop terminates early (e.g., all remaining rows are processed but some
entries weren't reached), entries could remain uninitialized.
Suggestion: Use `resize_fill(base_size + count)` or `memset` the entire
range to 0 after resize, then let the loop overwrite with actual values.
##########
be/src/storage/segment/column_reader.cpp:
##########
@@ -1838,6 +2094,18 @@ Status
FileColumnIterator::next_batch_of_zone_map(size_t* n, MutableColumnPtr& d
}
Status FileColumnIterator::next_batch(size_t* n, MutableColumnPtr& dst, bool*
has_null) {
+ if (read_null_map_only()) {
+ DLOG(INFO) << "File column iterator column " << _column_name
+ << " in NULL_MAP_ONLY mode, reading only null map.";
+ DORIS_CHECK(dst->is_nullable());
+ auto& nullable_col = assert_cast<ColumnNullable&>(*dst);
+ nullable_col.get_nested_column().insert_many_defaults(*n);
+ auto& null_map_data = nullable_col.get_null_map_data();
+ null_map_data.resize(null_map_data.size() + *n);
+ *has_null = true;
Review Comment:
**[Critical Bug — Data Correctness]** `PODArray::resize()` does NOT
zero-initialize new elements (see `pod_array.h:68`: *"It differs from
std::vector in that it does not initialize the elements"*). The new null map
entries will contain **uninitialized memory (garbage)**, causing random rows to
be incorrectly classified as null or non-null.
But fixing this alone is insufficient — this entire NULL_MAP_ONLY block
never reads the actual null bitmap from the segment pages. The normal path
(lines 2119-2162) iterates through pages, decodes the null bitmap via
`_page.null_decoder.GetNextRun()`, and processes each run. This path skips all
of that.
For scalar `FileColumnIterator`, the null bitmap is embedded in the data
pages (unlike complex types which have a separate `_null_iterator`). To
correctly implement NULL_MAP_ONLY here, you need to:
1. Iterate through pages (like the normal path does)
2. Read `_page.null_decoder` to get actual null/non-null status
3. Write the correct values into `null_map_data`
4. Skip `_page.data_decoder->next_batch()` calls (since we only want nulls)
Alternatively, if this mode is only used for complex type sub-iterators that
have their own `_null_iterator`, consider whether `FileColumnIterator` actually
needs NULL_MAP_ONLY support at all — the Map/Array/Struct iterators already
handle it via their own null iterators.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]