cshuo opened a new pull request, #18960:
URL: https://github.com/apache/hudi/pull/18960

   …flink2.1.x
   
   ### Describe the issue this Pull Request addresses
   
   The Flink integration's columnar Parquet reader previously relied on a 
custom, recursive set of nested-type readers (`ArrayColumnReader`, 
`MapColumnReader`, `RowColumnReader`, and the `ColumnarGroup*` data 
structures). This approach was harder to maintain and diverged from upstream 
Flink's reader implementation. Flink 2.1 introduced a Dremel-style nested 
Parquet reader that reconstructs nested structures from Parquet 
repetition/definition levels, which is more correct and aligned with the engine.
   
   This PR backports that Dremel-based nested Parquet reader into the 
`hudi-flink2.1.x` module, following the earlier backport to `hudi-flink1.20` 
(`86d1650d0f1`). It replaces the legacy recursive readers with the level-based 
decoding path while keeping the split reader API stable.
   
   ### Summary and Changelog
   
   The entire change is nearly a verbatim copy of the corresponding files from 
hudi-flink1.20.x into hudi-flink2.1.x. The only difference is flink 2.1 parquet 
reader supports reading Variant type field, which is also supported through 
`NestedColumnReader` now.
   
   - Rewrote `ParquetSplitReaderUtil` (~750 lines reworked) to construct the 
new level-based nested reader tree.
   - Added Dremel-style readers `NestedColumnReader` and 
`NestedPrimitiveColumnReader` to decode arrays, maps, and rows from 
repetition/definition levels.
   - Introduced a `ParquetField` type hierarchy (`ParquetField`, 
`ParquetGroupField`, `ParquetPrimitiveField`) describing the nested Parquet 
schema.
   - Added position/level helpers: `NestedPositionUtil`, `CollectionPosition`, 
`LevelDelegation`, `RowPosition`, and primitive list utilities (`IntArrayList`, 
`LongArrayList`, `BooleanArrayList`).
   - Extended `ParquetDataColumnReaderFactory` and `ParquetDecimalVector`; 
updated `HeapArrayVector`, `HeapMapColumnVector`, `HeapRowColumnVector`, and 
`ParquetColumnarRowSplitReader` to the new vectors.
   - Removed legacy nested reader/data classes: `ArrayColumnReader`, 
`MapColumnReader`, `RowColumnReader`, `ArrayGroupReader`, 
`ColumnarGroupArrayData`, `ColumnarGroupMapData`, `ColumnarGroupRowData`, 
`HeapArrayGroupColumnVector`.
   - Added unit/integration tests: `TestParquetDataColumnReaderFactory`, 
`TestParquetGroupField`, `TestParquetDecimalVector`, 
`TestHeapColumnVectorAccessors`, plus cases in `ITTestHoodieDataSource` and 
`TestSQL`.
   
   ### Impact
   
   - **Functional impact**: No intended behavior change for users; nested-type 
(array/map/row) reads from Parquet files are now decoded via the Dremel 
level-based path. Net +3345/−1427 lines across 33 files.
   - **Maintainability**: Aligns the `hudi-flink2.1.x` reader with upstream 
Flink 2.1 and the prior `hudi-flink1.20` backport, removing bespoke recursive 
readers in favor of a shared, level-based design.
   - **Extensibility**: The `ParquetField` type hierarchy and position 
utilities provide a clearer foundation for supporting additional/deeply nested 
Parquet structures.
   
   ### Risk Level
   
   Medium. The change touches the core Parquet read path and replaces the 
entire nested-type reader implementation, so it has a meaningful blast radius 
for Flink reads of nested data. Risk is mitigated by added unit tests for the 
new readers/types/vectors and by integration coverage in 
`ITTestHoodieDataSource`/`TestSQL`, and by mirroring the already-merged 
`hudi-flink1.20` backport.
   
   ### Documentation Update
   
   none
   
   ### Contributor's checklist
   
   - [ ] Read through [contributor's 
guide](https://hudi.apache.org/contribute/how-to-contribute)
   - [ ] Enough context is provided in the sections above
   - [ ] Adequate tests were added if applicable


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to