chandu-1101 opened a new issue, #9329:
URL: https://github.com/apache/hudi/issues/9329

   Issue:
   
   1. I have 39GB parquet file on s3 which is ingested into Apache hudi. This 
is snappy compressed. 
   2. I have 147GB json file-s on s3 representing CDC data. This is CDC from 
mongo db. 
   3. each row in json file is ~ 5-6kb (kilo bytes)
   4. -- When I try to merge them using spark SQL I see it completing in 9000 
vcore seconds (This runs every day in our production. The SQL runs on 8-10 node 
EMR cluster ( 10 X M5.2x -8cores 32GB ram; ~8-10 executors are spinned on avg  
). The same job when run with apache hudi takes 30000 (approx 22-30k vcore 
seconds) 
   
   
   I remember reading hudi is around LSM trees. And the way I understand COW of 
hudi is
   1. Hudi gets a bunch of updates aimed at a parquet. Then suddenly copies the 
old parquet into a new one with the updates in it. 
   2. Essentially this mechanism, in my view, should be much much faster than 
Spark SQL. HOwever its not the case. 
   
   One suggestion i got is to partition the data. But this requires full 
pipeline change. Any ways of getting this less than spark SQL vcore secs?
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Reply via email to