[ https://issues.apache.org/jira/browse/HUDI-5440?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
sivabalan narayanan updated HUDI-5440: -------------------------------------- Fix Version/s: 0.13.0 > Benchmark MDT performance for diff sizes of datasets > ---------------------------------------------------- > > Key: HUDI-5440 > URL: https://issues.apache.org/jira/browse/HUDI-5440 > Project: Apache Hudi > Issue Type: Improvement > Components: metadata > Reporter: sivabalan narayanan > Priority: Critical > Fix For: 0.13.0 > > > Our benchmarking, should span diff flavors or scenarios. > > Size here refers to the total files in the table and not the actual size in > GBs. > > 2 diff MDT state to test: \{fully compacted MDT, MDT with a compaction + few > log files} * for small dataset, MDT is close to direct FS. > * for medium dataset, MDT is moderately better compared to direct FS. > * for large scale, MDT is much faster compared to direct FS. > * for xlarge scale, MDT is much much faster compared to direct FS. > > Small: > 100 partitions with 10 files in each. (1k total files) > > Medium: > 100 partitions w/ 100 files in each. (10k total files) > > Large: > 1000 partitions w/ 1000 files in each. (1M total files) > > XLarge: > 50k partitions w/ 10k files in each. (500M total files) > > What calls to measure: > lets measure latency for # getAllPartitions. > # getAllFiles for a given partition for random 5% of partitions. -- This message was sent by Atlassian Jira (v8.20.10#820010)