chandu-1101 opened a new issue, #9329: URL: https://github.com/apache/hudi/issues/9329
Issue: 1. I have 39GB parquet file on s3 which is ingested into Apache hudi. This is snappy compressed. 2. I have 147GB json file-s on s3 representing CDC data. This is CDC from mongo db. 3. each row in json file is ~ 5-6kb (kilo bytes) 4. -- When I try to merge them using spark SQL I see it completing in 9000 vcore seconds (This runs every day in our production. The SQL runs on 8-10 node EMR cluster ( 10 X M5.2x -8cores 32GB ram; ~8-10 executors are spinned on avg ). The same job when run with apache hudi takes 30000 (approx 22-30k vcore seconds) I remember reading hudi is around LSM trees. And the way I understand COW of hudi is 1. Hudi gets a bunch of updates aimed at a parquet. Then suddenly copies the old parquet into a new one with the updates in it. 2. Essentially this mechanism, in my view, should be much much faster than Spark SQL. HOwever its not the case. One suggestion i got is to partition the data. But this requires full pipeline change. Any ways of getting this less than spark SQL vcore secs? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org