zclllyybb commented on issue #63887: URL: https://github.com/apache/doris/issues/63887#issuecomment-4570416436
Breakwater-GitHub-Analysis-Slot: slot_bb619bce1d11 Initial maintainer triage for `apache/doris#63887`. I checked the live issue metadata and current `upstream/master` (`6e27f117471f481e13cebabe0454dddc60e5245c`). The issue currently has no labels, and the title has already been updated to the concrete Parquet crash path. The report is credible and actionable. Evidence from the current code: - `FilterMap::init()` in `be/src/format/parquet/parquet_common.cpp` sets `_has_filter=true` when `filter_all=true`. If the caller passes `nullptr`, `filter_map_data()` remains `nullptr`. - `RowGroupReader::_rebuild_filter_map()` in `be/src/format/parquet/vparquet_group_reader.cpp` has a direct `filter_map.init(nullptr, total_rows, true)` path when rebuilding an all-filtered lazy-read filter map. - `ScalarColumnReader::_read_nested_column()` in `be/src/format/parquet/vparquet_column_reader.cpp` calls `gen_filter_map()` whenever `filter_map.has_filter()` is true. - `ScalarColumnReader::gen_filter_map()` in `be/src/format/parquet/vparquet_column_reader.h` unconditionally evaluates `filter_map.filter_map_data()[filter_loc]`. So the important invariant mismatch is real: `filter_all()` implies `has_filter()`, but it does not guarantee a non-null filter-map data pointer. For nested Parquet columns, the helper expands the parent row filter into a nested value filter and can dereference null instead of producing an empty result. This is consistent with the submitted stack around `_do_lazy_read -> _read_column_data -> StructColumnReader -> ScalarColumnReader::_read_nested_column -> gen_filter_map`. Maintainer judgment: This should be treated as a BE crash bug in the Parquet nested-column lazy-read path, not as only a standalone reproducer issue. The right fix direction is to handle `filter_map.filter_all()` before expanding the nested filter map, materializing a valid all-zero nested filter map or otherwise skipping nested decoding safely. Checking only `has_filter()` is insufficient. There is already an open candidate fix, `#63889`. Its direction matches the root cause above: it avoids calling `gen_filter_map()` when the parent filter map is `filter_all()` and creates an all-zero nested filter map with a valid backing buffer. Before merge, I would still ask for data-path coverage in addition to helper-level tests: - A regression with a real Parquet nested column (`STRUCT`, `ARRAY`, or `MAP`) read through the external-table path. - A predicate/lazy-read query where at least one row group or lazy-read batch is fully filtered before a nested non-predicate column is read. - Coverage for one of Paimon/Hive/Iceberg is enough if it reaches the shared Parquet reader path; the bug is in shared Parquet code, not obviously format-catalog specific. Missing information that would make the issue fully reproducible: - Exact affected build or commit for the crashing BE, not just `4.x/master`. - Minimal DDL/catalog setup and a small Parquet or table-format fixture that contains the nested column and filtered row group. - The actual SQL, predicate, and selected columns that trigger lazy read. - Full BE log around the crash with query id/profile if available. Recommended next steps: 1. Review `#63889` as the likely fix for this issue. 2. Add or request a real nested-Parquet lazy-read regression before merging or backporting. 3. Consider branch-pick labels for affected 4.x branches after the regression confirms the shared Parquet path. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
