Sugamber created HUDI-1668: ------------------------------ Summary: GlobalSortPartitioner is getting called twice during bulk_insert. Key: HUDI-1668 URL: https://issues.apache.org/jira/browse/HUDI-1668 Project: Apache Hudi Issue Type: Bug Reporter: Sugamber Attachments: 1st.png, 2nd.png
Hi Team, I'm using bulk insert option to load close to 2 TB data. The process is taking near by 2 hours to get completed. While looking at the job log, it is identified that [sortBy at GlobalSortPartitioner.java:41|https://gdlcuspc1a3-6.us-central1.us.walmart.net:18481/history/application_1614298633248_1444/1/jobs/job?id=1] is running twice. It is getting triggered at 1 stage. refer this screenshot. Second time it is getting trigged from *HoodieSparkSqlWriter.scala:433* *[count at HoodieSparkSqlWriter.scala:433|https://gdlcuspc1a3-6.us-central1.us.walmart.net:18481/history/application_1614298633248_1444/1/jobs/job?id=2]* step. In both cases, same number of job got triggered and running time is close to each other. Is there any way to run only one time so that data can be loaded faster. *Spark and Hudi configurations* {code:java} Spark - 2.3.0 Scala- 2.11.12 Hudi - 0.7.0 {code} Hudi Configuration {code:java} "hoodie.cleaner.commits.retained" = 2 "hoodie.bulkinsert.shuffle.parallelism"=2000 "hoodie.parquet.small.file.limit" = 100000000 "hoodie.parquet.max.file.size" = 128000000 "hoodie.index.bloom.num_entries" = 1800000 "hoodie.bloom.index.filter.type" = "DYNAMIC_V0" "hoodie.bloom.index.filter.dynamic.max.entries" = 2500000 "hoodie.bloom.index.bucketized.checking" = "false" "hoodie.datasource.write.operation" = "bulk_insert" "hoodie.datasource.write.table.type" = "COPY_ON_WRITE" {code} Spark Configuration - {code:java} --num-executors 180 --executor-cores 4 --executor-memory 16g --driver-memory=24g --conf spark.rdd.compress=true --queue=default --conf spark.serializer=org.apache.spark.serializer.KryoSerializer --conf spark.executor.memoryOverhead=1600 --conf spark.driver.memoryOverhead=1200 --conf spark.driver.maxResultSize=2g --conf spark.kryoserializer.buffer.max=512m {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)