xushiyan commented on issue #3697:
URL: https://github.com/apache/hudi/issues/3697#issuecomment-927248813


   From the parameters you show, i see the problem is mostly caused by not 
utilizing the machines' resources efficiently. Let's do some math:
   Say you use 4 m5.4xlarge machines each has 16 cores and 64g memory. 
   
   set the below confs should allow you to run 19 executors with 1 driver for 
the spark job. double check spark UI to confirm the executors you're getting
   
   ```
   spark.driver.cores=3
   spark.driver.memory=6g
   spark.driver.memoryOverhead=2g
   spark.executor.cores=3
   spark.executor.memory=6g
   spark.executor.memoryOverhead=2g
   spark.executor.instances=19
   spark.sql.shuffle.partitions=200
   spark.default.parallelism=200
   spark.task.cpus=1
   ```
   
   also set these hudi props in your spark writer options
   
   ```
     "hoodie.upsert.shuffle.parallelism" = 200,
     "hoodie.insert.shuffle.parallelism" = 200,
     "hoodie.finalize.write.parallelism" = 200,
     "hoodie.bulkinsert.shuffle.parallelism" = 200,
   ```
   
   also you don't need to construct hoodie_key and hoodie_partition yourself, 
please set the hoodie key generator class options properly in Spark options. 
Refer to [this 
blog](https://hudi.incubator.apache.org/blog/2021/02/13/hudi-key-generators/). 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Reply via email to