github-actions[bot] commented on code in PR #63192:
URL: https://github.com/apache/doris/pull/63192#discussion_r3233406941
##########
be/src/format/parquet/vparquet_reader.cpp:
##########
@@ -432,11 +435,111 @@ Status
ParquetReader::on_before_init_reader(ReaderInitContext* ctx) {
RETURN_IF_ERROR(get_file_metadata_schema(&field_desc));
RETURN_IF_ERROR(TableSchemaChangeHelper::BuildTableInfoUtil::by_parquet_name(
ctx->tuple_descriptor, *field_desc, ctx->table_info_node));
+ auto column_id_result = _create_column_ids_by_name(field_desc,
ctx->tuple_descriptor);
+ ctx->column_ids = std::move(column_id_result.column_ids);
+ ctx->filter_column_ids = std::move(column_id_result.filter_column_ids);
}
return Status::OK();
}
+ColumnIdResult ParquetReader::_create_column_ids_by_name(const
FieldDescriptor* field_desc,
+ const
TupleDescriptor* tuple_descriptor) {
+ auto* mutable_field_desc = const_cast<FieldDescriptor*>(field_desc);
+ mutable_field_desc->assign_ids();
+
+ std::unordered_map<std::string, const FieldSchema*>
table_col_name_to_field_schema_map;
+ for (int i = 0; i < field_desc->size(); ++i) {
+ auto field_schema = field_desc->get_column(i);
+ if (!field_schema) {
+ continue;
+ }
+ table_col_name_to_field_schema_map[field_schema->lower_case_name] =
field_schema;
+ }
+
+ std::set<uint64_t> column_ids;
+ std::set<uint64_t> filter_column_ids;
+
+ auto process_access_paths = [](const FieldSchema* parquet_field,
+ const std::vector<TColumnAccessPath>&
access_paths,
+ std::set<uint64_t>& out_ids) {
+ process_nested_access_paths(
+ parquet_field, access_paths, out_ids,
+ [](const FieldSchema* field) { return field->get_column_id();
},
+ [](const FieldSchema* field) { return
field->get_max_column_id(); },
+ HiveParquetNestedColumnUtils::extract_nested_column_ids);
+ };
+
+ for (const auto* slot : tuple_descriptor->slots()) {
+ auto it =
table_col_name_to_field_schema_map.find(slot->col_name_lower_case());
+ if (it == table_col_name_to_field_schema_map.end()) {
+ continue;
+ }
+ auto field_schema = it->second;
+
Review Comment:
This branch now treats `TYPE_VARIANT` like a nested-pruned complex type, but
a full VARIANT projection has no access path: `AccessPathExpressionCollector`
only records variant paths when a subcolumn is accessed. In a query like
`SELECT id, v FROM local(...)`, the scalar `id` inserts one column id, so
`_column_ids` is non-empty; the full `v` slot has empty `all_access_paths()`,
`process_access_paths()` inserts nothing for it, and the downstream
readers/profile collection prune all `v` leaves. Please add the full variant
field range when `all_access_paths()` is empty (or make FE emit an explicit
root access path) and add coverage for selecting the full VARIANT column
together with another column. This is distinct from the existing shredded
subpath threads because it affects full-slot projection rather than
residual/typed-value reconstruction.
##########
be/src/format/table/hive_reader.cpp:
##########
@@ -328,7 +328,7 @@ ColumnIdResult HiveParquetReader::_create_column_ids(const
FieldDescriptor* fiel
// primitive (non-nested) types
if ((slot->col_type() != TYPE_STRUCT && slot->col_type() != TYPE_ARRAY
&&
Review Comment:
Same full-slot pruning gap in the Hive Parquet path: after adding
`TYPE_VARIANT` to the complex branch, a selected VARIANT column with no subpath
contributes no column ids. If another projected/predicate column makes
`column_ids` non-empty, the Hive Parquet reader will skip the entire VARIANT
payload for queries such as `SELECT id, v FROM hive_table`. Please select the
full field range when `all_access_paths()` is empty, and apply the same fix to
the top-level-column-index helper below. This is a separate reader path from
the standalone local Parquet path.
##########
be/src/format/table/iceberg_reader.cpp:
##########
@@ -344,7 +344,7 @@ ColumnIdResult
IcebergParquetReader::_create_column_ids(const FieldDescriptor* f
auto field_schema = it->second;
if ((slot->col_type() != TYPE_STRUCT && slot->col_type() != TYPE_ARRAY
&&
- slot->col_type() != TYPE_MAP)) {
+ slot->col_type() != TYPE_MAP && slot->col_type() !=
TYPE_VARIANT)) {
Review Comment:
The Iceberg path has the same full VARIANT projection issue. A full slot
read has empty `all_access_paths()`, so once `TYPE_VARIANT` bypasses the
primitive branch it adds no ids for `v`; with any other selected Iceberg
column, `column_ids` is non-empty and all VARIANT leaves are pruned. Please add
the full field range for empty access paths (or emit a root access path from
FE) and cover `SELECT id, v` on an Iceberg VARIANT file. This is distinct from
the existing Iceberg subpath/residual pruning threads because no VARIANT
subpath is requested here.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]