nsivabalan opened a new issue, #18825:
URL: https://github.com/apache/hudi/issues/18825

   ## Tips before filing an issue
   
   - Have you gone through our [FAQs](https://hudi.apache.org/learn/faq/)? Yes
   
   - Join the mailing list to engage in conversations and get faster support at 
[email protected].
   
   - If you have triaged this as a bug, then file an 
[issue](https://issues.apache.org/jira/projects/HUDI/issues) directly.
   
   ## Describe the problem you faced
   
   During Record Level Index (RLI) bootstrap, 
`HoodieBackedTableMetadataWriter#initializeRecordIndexPartition` reads record 
keys from the data table base files (and merged log files, for MOR) into an 
`HoodieData<HoodieRecord>` and then calls `estimateFileGroupCount(records)`. 
The estimation supplier persists the RDD 
(`records.persist("MEMORY_AND_DISK_SER")`) and then iterates over the entire 
dataset (`records.count()`) just to obtain the total record count used to size 
the RLI file groups.
   
   On large tables this is the primary latency and memory bottleneck of RLI 
bootstrap:
   - The full materialized RDD of `(record key, location)` pairs is forced into 
the cluster's storage tier before any RLI commit work can start.
   - Even though we already know the on-disk row counts 
(Parquet/ORC/HFile/Lance footers carry this), we pay an extra distributed scan 
over every row.
   - When the user has pinned the RLI file group count (`min == max`), the 
count is not even needed — the estimation runs anyway.
   
   ## To Reproduce
   
   Steps to reproduce the behavior:
   
   1. Enable RLI on a sufficiently large existing table (e.g. via 
`hoodie.metadata.record.index.enable=true` or 
`hoodie.metadata.record.level.index.enable=true`).
   2. Trigger metadata table initialization (any write to the data table will 
do so).
   3. Observe in the driver logs that RLI initialization spends a significant 
fraction of time on a Spark stage that persists and counts the RLI records RDD 
before any file groups are written.
   
   ## Expected behavior
   
   RLI bootstrap should estimate the total record count from already-available 
footer metadata of base files rather than materializing and counting an RDD of 
record entries. When `record.index.min.filegroup.count == 
record.index.max.filegroup.count` the estimation should be skipped entirely and 
the configured value used directly.
   
   ## Environment Description
   
   - Hudi version : master (post 1.x)
   - Spark version : any
   - Hive version : N/A
   - Hadoop version : N/A
   - Storage (HDFS/S3/GCS..) : any
   - Running on Docker? (yes/no) : no
   
   ## Additional context
   
   This issue is filed to track the optimization of the RLI bootstrap path. A 
patch is forthcoming.
   
   ## Stacktrace
   
   N/A (performance issue, not a crash).
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to