linliu-code opened a new issue, #18861:
URL: https://github.com/apache/hudi/issues/18861

   ### Describe the problem
   
   During compaction and clustering **plan scheduling**, the planner collects 
eligible file slices one partition at a time: each partition is processed as an 
independent task that issues its own file listing. On metadata-table (MDT) 
backed tables, each of those listings is a separate read against the MDT 
`files` partition.
   
   For tables with a large number of partitions this becomes **O(N) independent 
metadata reads**, serialized across the available executor cores and dominated 
by per-read I/O latency despite negligible CPU per task. As a result, 
plan-generation latency grows roughly linearly with partition count and can 
dominate the scheduling phase for partition-heavy MoR/streaming tables.
   
   ### Why it's avoidable
   
   The metadata table already supports fetching files for many partitions in a 
**single batched read**, and the file-system view already exposes a partition 
pre-load entry point — the plan generators simply don't use them. There's also 
a **latent** case where the filesystem-backed metadata lists partitions 
sequentially on the driver.
   
   ### Proposal
   
   Pre-load all required partitions in one batched metadata read before 
building the plan, so plan generation issues a single read instead of N. Gate 
this on metadata-table availability so non-MDT tables keep today's 
fully-distributed listing path, and parallelize the sequential 
filesystem-backed listing. The produced plan is unchanged.
   
   ### Impact
   
   Lower, partition-count-independent plan-scheduling latency on a hot path 
exercised by every MoR/streaming deployment.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to