[ https://issues.apache.org/jira/browse/HUDI-1970?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17433587#comment-17433587 ]
Raymond Xu commented on HUDI-1970: ---------------------------------- * 1B records (randomized values in the example trip model) * 100 partitions, evenly distributed, year=*/month=*/day=*, 50 parquet files / partition * EMR 6.2 Spark 3.0.1-amzn-0 * S3, parquet compression snappy * hudi: 109.8 GB = 22.4 MB parquet x 5000 * delta: 70.9 GB = 14.5 MB parquet x 5000 |SQL|Hudi 0.9.0| |select fare, begin_lon, begin_lat, ts from hudi_trips_snapshot where fare > 20.0|129.352|108.312|104.914| |select count(*) from hudi_trips_snapshot|96.001|83.839|66.973| |select count(*) from hudi_trips_snapshot where year = '2020' and month = '03' and day = '01'|1.880|1.776|1.767| |select fare, begin_lon, begin_lat, ts from hudi_trips_snapshot where year='2020' and month='03' and day='01' and fare between 20 and 50|3.650|3.147|3.086| > Performance testing/certification of key SQL DMLs > ------------------------------------------------- > > Key: HUDI-1970 > URL: https://issues.apache.org/jira/browse/HUDI-1970 > Project: Apache Hudi > Issue Type: Sub-task > Components: Performance, Spark Integration > Reporter: Vinoth Chandar > Assignee: Raymond Xu > Priority: Blocker > Fix For: 0.10.0 > > -- This message was sent by Atlassian Jira (v8.3.4#803005)