Re: [I] [Bug](parquet) SIGSEGV in gen_filter_map when reading nested columns with filter_all=true [doris]

via GitHub Thu, 28 May 2026 21:03:11 -0700


zclllyybb commented on issue #63887:
URL: https://github.com/apache/doris/issues/63887#issuecomment-4570416436


   Breakwater-GitHub-Analysis-Slot: slot_bb619bce1d11
   
   Initial maintainer triage for `apache/doris#63887`.
   
   I checked the live issue metadata and current `upstream/master` 
(`6e27f117471f481e13cebabe0454dddc60e5245c`). The issue currently has no 
labels, and the title has already been updated to the concrete Parquet crash 
path. The report is credible and actionable.
   
   Evidence from the current code:
   
   - `FilterMap::init()` in `be/src/format/parquet/parquet_common.cpp` sets 
`_has_filter=true` when `filter_all=true`. If the caller passes `nullptr`, 
`filter_map_data()` remains `nullptr`.
   - `RowGroupReader::_rebuild_filter_map()` in 
`be/src/format/parquet/vparquet_group_reader.cpp` has a direct 
`filter_map.init(nullptr, total_rows, true)` path when rebuilding an 
all-filtered lazy-read filter map.
   - `ScalarColumnReader::_read_nested_column()` in 
`be/src/format/parquet/vparquet_column_reader.cpp` calls `gen_filter_map()` 
whenever `filter_map.has_filter()` is true.
   - `ScalarColumnReader::gen_filter_map()` in 
`be/src/format/parquet/vparquet_column_reader.h` unconditionally evaluates 
`filter_map.filter_map_data()[filter_loc]`.
   
   So the important invariant mismatch is real: `filter_all()` implies 
`has_filter()`, but it does not guarantee a non-null filter-map data pointer. 
For nested Parquet columns, the helper expands the parent row filter into a 
nested value filter and can dereference null instead of producing an empty 
result. This is consistent with the submitted stack around `_do_lazy_read -> 
_read_column_data -> StructColumnReader -> 
ScalarColumnReader::_read_nested_column -> gen_filter_map`.
   
   Maintainer judgment:
   
   This should be treated as a BE crash bug in the Parquet nested-column 
lazy-read path, not as only a standalone reproducer issue. The right fix 
direction is to handle `filter_map.filter_all()` before expanding the nested 
filter map, materializing a valid all-zero nested filter map or otherwise 
skipping nested decoding safely. Checking only `has_filter()` is insufficient.
   
   There is already an open candidate fix, `#63889`. Its direction matches the 
root cause above: it avoids calling `gen_filter_map()` when the parent filter 
map is `filter_all()` and creates an all-zero nested filter map with a valid 
backing buffer. Before merge, I would still ask for data-path coverage in 
addition to helper-level tests:
   
   - A regression with a real Parquet nested column (`STRUCT`, `ARRAY`, or 
`MAP`) read through the external-table path.
   - A predicate/lazy-read query where at least one row group or lazy-read 
batch is fully filtered before a nested non-predicate column is read.
   - Coverage for one of Paimon/Hive/Iceberg is enough if it reaches the shared 
Parquet reader path; the bug is in shared Parquet code, not obviously 
format-catalog specific.
   
   Missing information that would make the issue fully reproducible:
   
   - Exact affected build or commit for the crashing BE, not just `4.x/master`.
   - Minimal DDL/catalog setup and a small Parquet or table-format fixture that 
contains the nested column and filtered row group.
   - The actual SQL, predicate, and selected columns that trigger lazy read.
   - Full BE log around the crash with query id/profile if available.
   
   Recommended next steps:
   
   1. Review `#63889` as the likely fix for this issue.
   2. Add or request a real nested-Parquet lazy-read regression before merging 
or backporting.
   3. Consider branch-pick labels for affected 4.x branches after the 
regression confirms the shared Parquet path.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [I] [Bug](parquet) SIGSEGV in gen_filter_map when reading nested columns with filter_all=true [doris]

Reply via email to