danny0405 opened a new pull request, #18987:
URL: https://github.com/apache/hudi/pull/18987

   ### Describe the issue this Pull Request addresses
   
   RFC-103 introduces an LSM tree file-group layout where base and log files 
are sorted by record key and merged with a streaming k-way merge. The reader 
side needs a dedicated implementation for that layout without changing the 
existing `HoodieFileGroupReader` path.
   
   The design also uses native parquet log files instead of Avro log files with 
embedded parquet data blocks. Native data logs use 
`<fileId>_<writeToken>_<instant>_<version>.parquet`, and native delete logs use 
`<fileId>_<writeToken>_<instant>_<version>.delete.parquet`, so common file-name 
parsing and file-system view classification need to recognize those files 
correctly.
   
   ### Summary and Changelog
   
   Adds a separate LSM file-group reader for native parquet log files and 
updates common log-file parsing to recognize RFC-style native parquet 
data/delete logs.
   
   #### Commit 1: feat:(DNM) add a lsm-tree based FG reader (`f0b63593dedd`)
   - Added `HoodieLsmFileGroupReader` as a separate reader entry point instead 
of modifying `HoodieFileGroupReader`.
   - Added `LsmFileGroupRecordIterator` to perform streaming sorted k-way merge 
over one active record per base/log file.
   - Implemented the k-way merge with a loser-tree state machine, deterministic 
same-key ordering, and existing `BufferedRecordMerger` semantics.
   - Preserved existing tie behavior for equal ordering values by processing 
sources in merge order: base file first, then log files ordered by 
instant/version/write token/suffix, so later log records win when ordering 
values are equal.
   - Read native parquet data logs directly through `HoodieReaderContext` and 
added reader-side handling for native delete parquet logs with the fixed delete 
schema.
   - Added native parquet log parsing in `FSUtils` and `HoodieLogFile`, 
including data log and `.delete.parquet` delete log names.
   - Updated `AbstractTableFileSystemView` so native parquet log files are 
classified as log files and excluded from base-file discovery.
   - Added `TestHoodieLogFile` coverage for native parquet data/delete log 
parsing and helper extraction.
   
   ### Impact
   
   This adds a new reader implementation for LSM file groups without changing 
the existing `HoodieFileGroupReader` behavior. It affects common file-name 
parsing and file-system view classification for native parquet log files, 
enabling readers to distinguish native log v2 files from regular parquet base 
files.
   
   No writer path, table config default, or existing Avro log reader behavior 
is changed. The main compatibility impact is that RFC-style native parquet log 
files are now recognized as Hudi log files by common utilities.
   
   ### Risk Level
   
   medium
   
   The change touches common file parsing and file-system view classification, 
which are core read-path utilities. The new LSM reader also implements merge 
ordering semantics that must stay consistent with existing file-group merge 
behavior. Risk is mitigated by keeping the LSM reader separate from 
`HoodieFileGroupReader`, preserving existing merge APIs, and validating with:
   
   - `mvn -pl hudi-common -DskipTests compile`
   - `mvn -pl hudi-common -DskipITs -Dtest=TestHoodieLogFile test`
   
   ### Documentation Update
   
   none
   
   This PR adds reader implementation and native log-file recognition but does 
not introduce a new user-facing config, default behavior change, or public 
documentation surface in this repo. The behavior follows the RFC-103 design.
   
   ### Contributor's checklist
   
   - [ ] Read through [contributor's 
guide](https://hudi.apache.org/contribute/how-to-contribute)
   - [ ] Enough context is provided in the sections above
   - [ ] Adequate tests were added if applicable
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to