Rap70r opened a new issue #3697: URL: https://github.com/apache/hudi/issues/3697
Hello, We are using Spark and Hudi to upsert records into parquet in S3, extracted from Kafka, using EMR. The events could be either inserts or updates. Currently, it takes 41 minutes for the process to extract and upsert 1,430,000 records (1714 Megabytes). We are trying to increase the speed of this process. Below are the details of our environment **Environment Description** * Hudi version : 0.9.0 * EMR version : 6.4.0 > Master Instance: 1 r5.xlarge > Core Instance: 1 c5.xlarge > Task Instance: 25 c5.xlarge * Spark version : 3.1.2 * Hive version : n/a * Hadoop version : 3.2.1 * Source : Kafka * Storage : S3 (as parquet) * Partitions: 1100 * Partition Size: ~1MB to 30MB each * Parallelism: 3000 * Operation: Upsert * Partition : Concatenation of year, month and week of a date field * Storage Type: COPY_ON_WRITE * Running on Docker? : no **Spark-Submit Configs** `spark-submit --deploy-mode cluster --conf spark.dynamicAllocation.enabled=true --conf spark.dynamicAllocation.cachedExecutorIdleTimeout=300s --conf spark.dynamicAllocation.executorIdleTimeout=300s --conf spark.scheduler.mode=FAIR --conf spark.memory.fraction=0.4 --conf spark.memory.storageFraction=0.1 --conf spark.shuffle.service.enabled=true --conf spark.sql.hive.convertMetastoreParquet=false --conf spark.sql.parquet.mergeSchema=true --conf spark.driver.maxResultSize=4g --conf spark.driver.memory=4g --conf spark.executor.cores=4 --conf spark.driver.memoryOverhead=1g --conf spark.executor.instances=100 --conf spark.executor.memoryOverhead=1g --conf spark.driver.cores=4 --conf spark.executor.memory=4g --conf spark.rdd.compress=true --conf spark.kryoserializer.buffer.max=512m --conf spark.serializer=org.apache.spark.serializer.KryoSerializer --conf spark.yarn.nodemanager.vmem-check-enabled=false --conf yarn.nodemanager.pmem-check-enabled=false --conf spark.sql.shuffle.partitions=10 0 --conf spark.default.parallelism=100 --conf spark.task.cpus=2` **Spark Job** ![image](https://user-images.githubusercontent.com/22181358/134231023-4aa94788-5f68-4610-843c-1e98187aa810.png) From the job above, it seems that most of the time is consumed by UpsertPartitioner and SparkUpsertCommitActionExecutor events. Do you have any suggestions on how to reduce the time above job takes to complete? Thank you -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org