cshuo opened a new pull request, #18960: URL: https://github.com/apache/hudi/pull/18960
…flink2.1.x ### Describe the issue this Pull Request addresses The Flink integration's columnar Parquet reader previously relied on a custom, recursive set of nested-type readers (`ArrayColumnReader`, `MapColumnReader`, `RowColumnReader`, and the `ColumnarGroup*` data structures). This approach was harder to maintain and diverged from upstream Flink's reader implementation. Flink 2.1 introduced a Dremel-style nested Parquet reader that reconstructs nested structures from Parquet repetition/definition levels, which is more correct and aligned with the engine. This PR backports that Dremel-based nested Parquet reader into the `hudi-flink2.1.x` module, following the earlier backport to `hudi-flink1.20` (`86d1650d0f1`). It replaces the legacy recursive readers with the level-based decoding path while keeping the split reader API stable. ### Summary and Changelog The entire change is nearly a verbatim copy of the corresponding files from hudi-flink1.20.x into hudi-flink2.1.x. The only difference is flink 2.1 parquet reader supports reading Variant type field, which is also supported through `NestedColumnReader` now. - Rewrote `ParquetSplitReaderUtil` (~750 lines reworked) to construct the new level-based nested reader tree. - Added Dremel-style readers `NestedColumnReader` and `NestedPrimitiveColumnReader` to decode arrays, maps, and rows from repetition/definition levels. - Introduced a `ParquetField` type hierarchy (`ParquetField`, `ParquetGroupField`, `ParquetPrimitiveField`) describing the nested Parquet schema. - Added position/level helpers: `NestedPositionUtil`, `CollectionPosition`, `LevelDelegation`, `RowPosition`, and primitive list utilities (`IntArrayList`, `LongArrayList`, `BooleanArrayList`). - Extended `ParquetDataColumnReaderFactory` and `ParquetDecimalVector`; updated `HeapArrayVector`, `HeapMapColumnVector`, `HeapRowColumnVector`, and `ParquetColumnarRowSplitReader` to the new vectors. - Removed legacy nested reader/data classes: `ArrayColumnReader`, `MapColumnReader`, `RowColumnReader`, `ArrayGroupReader`, `ColumnarGroupArrayData`, `ColumnarGroupMapData`, `ColumnarGroupRowData`, `HeapArrayGroupColumnVector`. - Added unit/integration tests: `TestParquetDataColumnReaderFactory`, `TestParquetGroupField`, `TestParquetDecimalVector`, `TestHeapColumnVectorAccessors`, plus cases in `ITTestHoodieDataSource` and `TestSQL`. ### Impact - **Functional impact**: No intended behavior change for users; nested-type (array/map/row) reads from Parquet files are now decoded via the Dremel level-based path. Net +3345/−1427 lines across 33 files. - **Maintainability**: Aligns the `hudi-flink2.1.x` reader with upstream Flink 2.1 and the prior `hudi-flink1.20` backport, removing bespoke recursive readers in favor of a shared, level-based design. - **Extensibility**: The `ParquetField` type hierarchy and position utilities provide a clearer foundation for supporting additional/deeply nested Parquet structures. ### Risk Level Medium. The change touches the core Parquet read path and replaces the entire nested-type reader implementation, so it has a meaningful blast radius for Flink reads of nested data. Risk is mitigated by added unit tests for the new readers/types/vectors and by integration coverage in `ITTestHoodieDataSource`/`TestSQL`, and by mirroring the already-merged `hudi-flink1.20` backport. ### Documentation Update none ### Contributor's checklist - [ ] Read through [contributor's guide](https://hudi.apache.org/contribute/how-to-contribute) - [ ] Enough context is provided in the sections above - [ ] Adequate tests were added if applicable -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
