clee704 opened a new pull request, #3458: URL: https://github.com/apache/parquet-java/pull/3458
Fixes #3457 ### What changes were proposed in this pull request? Add validation in `ColumnIndexBuilder.build(PrimitiveType)` to detect a contradictory column index where `null_pages[i]` is `true` (page consists entirely of null values) but `null_counts[i]` is `0` (page has zero null values). When detected, `build()` returns `null`, following the existing pattern for invalid min/max values. Also fix a pre-existing NPE in the static `build()` method where `build(type)` could return `null` but was not null-checked before accessing `boundaryOrder`. ### Why are the changes needed? Column index filtering silently excludes pages with this contradiction from query results for all predicates: - **Non-null predicates** (e.g., `WHERE col = 50`): `BoundaryOrder` comparators iterate over `pageIndexes`, which omits pages where `null_pages[i]` is `true`. These pages are never evaluated and their rows are excluded. - **Null predicates** (e.g., `WHERE col IS NULL`): `ColumnIndexBase.visit(Eq)` checks `nullCounts[pageIndex] > 0`, which returns `false` when `null_counts` is `0`. The page is excluded. The result is silent data loss with no error or warning. Only unfiltered reads return correct results. `ColumnIndexBuilder.build(PrimitiveType)` already returns `null` for other kinds of invalid metadata (empty pages, invalid min/max values). This adds one more validation of the same kind. ### How was this patch tested? 3 new test methods (15 assertions) in `TestColumnIndexBuilder`: - `testBuildReturnsNullForNullPageCountContradiction`: 5 rejection cases — contradiction at first/middle/last page, single page, all pages - `testBuildPreservesValidColumnIndex`: 6 preservation cases — legitimate null pages, all non-null pages, single pages, boundary null_counts=1 - `testBuildWithoutNullCountsIsNotRejected`: null_counts absent (optional field) is not rejected -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
