nsivabalan opened a new issue, #18825: URL: https://github.com/apache/hudi/issues/18825
## Tips before filing an issue - Have you gone through our [FAQs](https://hudi.apache.org/learn/faq/)? Yes - Join the mailing list to engage in conversations and get faster support at [email protected]. - If you have triaged this as a bug, then file an [issue](https://issues.apache.org/jira/projects/HUDI/issues) directly. ## Describe the problem you faced During Record Level Index (RLI) bootstrap, `HoodieBackedTableMetadataWriter#initializeRecordIndexPartition` reads record keys from the data table base files (and merged log files, for MOR) into an `HoodieData<HoodieRecord>` and then calls `estimateFileGroupCount(records)`. The estimation supplier persists the RDD (`records.persist("MEMORY_AND_DISK_SER")`) and then iterates over the entire dataset (`records.count()`) just to obtain the total record count used to size the RLI file groups. On large tables this is the primary latency and memory bottleneck of RLI bootstrap: - The full materialized RDD of `(record key, location)` pairs is forced into the cluster's storage tier before any RLI commit work can start. - Even though we already know the on-disk row counts (Parquet/ORC/HFile/Lance footers carry this), we pay an extra distributed scan over every row. - When the user has pinned the RLI file group count (`min == max`), the count is not even needed — the estimation runs anyway. ## To Reproduce Steps to reproduce the behavior: 1. Enable RLI on a sufficiently large existing table (e.g. via `hoodie.metadata.record.index.enable=true` or `hoodie.metadata.record.level.index.enable=true`). 2. Trigger metadata table initialization (any write to the data table will do so). 3. Observe in the driver logs that RLI initialization spends a significant fraction of time on a Spark stage that persists and counts the RLI records RDD before any file groups are written. ## Expected behavior RLI bootstrap should estimate the total record count from already-available footer metadata of base files rather than materializing and counting an RDD of record entries. When `record.index.min.filegroup.count == record.index.max.filegroup.count` the estimation should be skipped entirely and the configured value used directly. ## Environment Description - Hudi version : master (post 1.x) - Spark version : any - Hive version : N/A - Hadoop version : N/A - Storage (HDFS/S3/GCS..) : any - Running on Docker? (yes/no) : no ## Additional context This issue is filed to track the optimization of the RLI bootstrap path. A patch is forthcoming. ## Stacktrace N/A (performance issue, not a crash). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
