nsivabalan opened a new pull request, #18826: URL: https://github.com/apache/hudi/pull/18826
### Change Logs Fixes https://github.com/apache/hudi/issues/18825. During RLI bootstrap, `HoodieBackedTableMetadataWriter#initializeRecordIndexPartition` previously persisted the full materialized RDD of RLI records (`records.persist("MEMORY_AND_DISK_SER")`) and counted it (`records.count()`) purely to obtain the total record count used to size the RLI file groups. This is the primary latency and memory bottleneck of RLI bootstrap on large tables. This PR: - Replaces the persist+count of the RLI records with a direct row-count read from each base file's footer metadata via `FileFormatUtils.getRowCount(...)`. Footer reads are O(1) per file and avoid materializing the record dataset. - Reuses the file slices already collected for record-key reading; for MOR, base files are extracted from the file slices already in hand (estimation uses base file row counts only; log file deltas are bounded and not material for sizing). - Bypasses estimation entirely when the user pins the RLI file group count via `min == max`, and uses the configured value directly. - Adds `TestHoodieBackedMetadata#testRecordIndexFileGroupEstimation` and `testRecordIndexWithFixedFileGroupCount` covering both COW and MOR. ### Impact Decouples RLI file group sizing from materializing the record-keys RDD. Eliminates the persist+count pass during RLI bootstrap, which is the primary latency bottleneck on large tables. ### Risk level low ### Documentation Update No new configs or user-facing behavior changes. ### Contributor's checklist - [x] Read through [contributor's guide](https://hudi.apache.org/contribute/how-to-contribute) - [x] Change Logs and Impact were stated clearly - [x] Adequate tests were added if applicable - [ ] CI passed -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
