[I] Benchmark MDT performance for diff sizes of datasets [hudi]

via GitHub Sat, 29 Nov 2025 22:14:46 -0800


hudi-bot opened a new issue, #15646:
URL: https://github.com/apache/hudi/issues/15646


   Our benchmarking, should span diff flavors or scenarios.
    
   Size here refers to the total files in the table and not the actual size in 
GBs.
    
   2 diff MDT state to test: \{fully compacted MDT, MDT with a compaction + few 
log files} * for small dataset, MDT is close to direct FS.
    * for medium dataset, MDT is moderately better compared to direct FS.
    * for large scale, MDT is much faster compared to direct FS.
    * for xlarge scale, MDT is much much faster compared to direct FS.
   
    
   Small:
   100 partitions with 10 files in each. (1k total files)
    
   Medium:
   100 partitions w/ 100 files in each. (10k total files)
    
   Large:
   1000 partitions w/ 1000 files in each. (1M total files)
    
   XLarge:
   50k partitions w/ 10k files in each. (500M total files)
    
   What calls to measure:
   lets measure latency for # getAllPartitions.
    # getAllFiles for a given partition for random 5% of partitions.
   
   ## JIRA info
   
   - Link: https://issues.apache.org/jira/browse/HUDI-5440
   - Type: Improvement
   - Epic: https://issues.apache.org/jira/browse/HUDI-1292
   - Fix version(s):
     - 1.1.0


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[I] Benchmark MDT performance for diff sizes of datasets [hudi]

Reply via email to