clee704 opened a new pull request, #3458:
URL: https://github.com/apache/parquet-java/pull/3458

   Fixes #3457
   
   ### What changes were proposed in this pull request?
   
   Add validation in `ColumnIndexBuilder.build(PrimitiveType)` to detect a 
contradictory column index where `null_pages[i]` is `true` (page consists 
entirely of null values) but `null_counts[i]` is `0` (page has zero null 
values). When detected, `build()` returns `null`, following the existing 
pattern for invalid min/max values.
   
   Also fix a pre-existing NPE in the static `build()` method where 
`build(type)` could return `null` but was not null-checked before accessing 
`boundaryOrder`.
   
   ### Why are the changes needed?
   
   Column index filtering silently excludes pages with this contradiction from 
query results for all predicates:
   - **Non-null predicates** (e.g., `WHERE col = 50`): `BoundaryOrder` 
comparators iterate over `pageIndexes`, which omits pages where `null_pages[i]` 
is `true`. These pages are never evaluated and their rows are excluded.
   - **Null predicates** (e.g., `WHERE col IS NULL`): 
`ColumnIndexBase.visit(Eq)` checks `nullCounts[pageIndex] > 0`, which returns 
`false` when `null_counts` is `0`. The page is excluded.
   
   The result is silent data loss with no error or warning. Only unfiltered 
reads return correct results.
   
   `ColumnIndexBuilder.build(PrimitiveType)` already returns `null` for other 
kinds of invalid metadata (empty pages, invalid min/max values). This adds one 
more validation of the same kind.
   
   ### How was this patch tested?
   
   3 new test methods (15 assertions) in `TestColumnIndexBuilder`:
   - `testBuildReturnsNullForNullPageCountContradiction`: 5 rejection cases — 
contradiction at first/middle/last page, single page, all pages
   - `testBuildPreservesValidColumnIndex`: 6 preservation cases — legitimate 
null pages, all non-null pages, single pages, boundary null_counts=1
   - `testBuildWithoutNullCountsIsNotRejected`: null_counts absent (optional 
field) is not rejected


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to