[ https://issues.apache.org/jira/browse/HUDI-1668?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Sugamber updated HUDI-1668: --------------------------- Description: Hi Team, I'm using bulk insert option to load close to 2 TB data. The process is taking near by 2 hours to get completed. While looking at the job log, it is identified that [sortBy at GlobalSortPartitioner.java:41|https://gdlcuspc1a3-6.us-central1.us.walmart.net:18481/history/application_1614298633248_1444/1/jobs/job?id=1] is running twice. It is getting triggered at 1 stage. *refer this screenshot ->[^1st.png]*. Second time it is getting trigged from *HoodieSparkSqlWriter.scala:433* *[count at HoodieSparkSqlWriter.scala:433|https://gdlcuspc1a3-6.us-central1.us.walmart.net:18481/history/application_1614298633248_1444/1/jobs/job?id=2]* step. In both cases, same number of job got triggered and running time is close to each other. *Refer this screenshot* -> [^2nd.png] Is there any way to run only one time so that data can be loaded faster. *Spark and Hudi configurations* {code:java} Spark - 2.3.0 Scala- 2.11.12 Hudi - 0.7.0 {code} Hudi Configuration {code:java} "hoodie.cleaner.commits.retained" = 2 "hoodie.bulkinsert.shuffle.parallelism"=2000 "hoodie.parquet.small.file.limit" = 100000000 "hoodie.parquet.max.file.size" = 128000000 "hoodie.index.bloom.num_entries" = 1800000 "hoodie.bloom.index.filter.type" = "DYNAMIC_V0" "hoodie.bloom.index.filter.dynamic.max.entries" = 2500000 "hoodie.bloom.index.bucketized.checking" = "false" "hoodie.datasource.write.operation" = "bulk_insert" "hoodie.datasource.write.table.type" = "COPY_ON_WRITE" {code} Spark Configuration - {code:java} --num-executors 180 --executor-cores 4 --executor-memory 16g --driver-memory=24g --conf spark.rdd.compress=true --queue=default --conf spark.serializer=org.apache.spark.serializer.KryoSerializer --conf spark.executor.memoryOverhead=1600 --conf spark.driver.memoryOverhead=1200 --conf spark.driver.maxResultSize=2g --conf spark.kryoserializer.buffer.max=512m {code} was: Hi Team, I'm using bulk insert option to load close to 2 TB data. The process is taking near by 2 hours to get completed. While looking at the job log, it is identified that [sortBy at GlobalSortPartitioner.java:41|https://gdlcuspc1a3-6.us-central1.us.walmart.net:18481/history/application_1614298633248_1444/1/jobs/job?id=1] is running twice. It is getting triggered at 1 stage. *refer this screenshot ->[^1st.png]*. Second time it is getting trigged from *HoodieSparkSqlWriter.scala:433* *[count at HoodieSparkSqlWriter.scala:433|https://gdlcuspc1a3-6.us-central1.us.walmart.net:18481/history/application_1614298633248_1444/1/jobs/job?id=2]* step. In both cases, same number of job got triggered and running time is close to each other. *Refer this screenshot* -> [^2nd.png] Is there any way to run only one time so that data can be loaded faster. *Spark and Hudi configurations* {code:java} Spark - 2.3.0 Scala- 2.11.12 Hudi - 0.7.0 {code} Hudi Configuration {code:java} "hoodie.cleaner.commits.retained" = 2 "hoodie.bulkinsert.shuffle.parallelism"=2000 "hoodie.parquet.small.file.limit" = 100000000 "hoodie.parquet.max.file.size" = 128000000 "hoodie.index.bloom.num_entries" = 1800000 "hoodie.bloom.index.filter.type" = "DYNAMIC_V0" "hoodie.bloom.index.filter.dynamic.max.entries" = 2500000 "hoodie.bloom.index.bucketized.checking" = "false" "hoodie.datasource.write.operation" = "bulk_insert" "hoodie.datasource.write.table.type" = "COPY_ON_WRITE" {code} Spark Configuration - {code:java} --num-executors 180 --executor-cores 4 --executor-memory 16g --driver-memory=24g --conf spark.rdd.compress=true --queue=default --conf spark.serializer=org.apache.spark.serializer.KryoSerializer --conf spark.executor.memoryOverhead=1600 --conf spark.driver.memoryOverhead=1200 --conf spark.driver.maxResultSize=2g --conf spark.kryoserializer.buffer.max=512m {code} > GlobalSortPartitioner is getting called twice during bulk_insert. > ----------------------------------------------------------------- > > Key: HUDI-1668 > URL: https://issues.apache.org/jira/browse/HUDI-1668 > Project: Apache Hudi > Issue Type: Bug > Reporter: Sugamber > Priority: Minor > Attachments: 1st.png, 2nd.png > > > Hi Team, > I'm using bulk insert option to load close to 2 TB data. The process is > taking near by 2 hours to get completed. While looking at the job log, it is > identified that [sortBy at > GlobalSortPartitioner.java:41|https://gdlcuspc1a3-6.us-central1.us.walmart.net:18481/history/application_1614298633248_1444/1/jobs/job?id=1] > is running twice. > It is getting triggered at 1 stage. *refer this screenshot ->[^1st.png]*. > Second time it is getting trigged from *HoodieSparkSqlWriter.scala:433* > *[count at > HoodieSparkSqlWriter.scala:433|https://gdlcuspc1a3-6.us-central1.us.walmart.net:18481/history/application_1614298633248_1444/1/jobs/job?id=2]* > step. > In both cases, same number of job got triggered and running time is close to > each other. *Refer this screenshot* -> [^2nd.png] > Is there any way to run only one time so that data can be loaded faster. > *Spark and Hudi configurations* > > {code:java} > Spark - 2.3.0 > Scala- 2.11.12 > Hudi - 0.7.0 > > {code} > > Hudi Configuration > {code:java} > "hoodie.cleaner.commits.retained" = 2 > "hoodie.bulkinsert.shuffle.parallelism"=2000 > "hoodie.parquet.small.file.limit" = 100000000 > "hoodie.parquet.max.file.size" = 128000000 > "hoodie.index.bloom.num_entries" = 1800000 > "hoodie.bloom.index.filter.type" = "DYNAMIC_V0" > "hoodie.bloom.index.filter.dynamic.max.entries" = 2500000 > "hoodie.bloom.index.bucketized.checking" = "false" > "hoodie.datasource.write.operation" = "bulk_insert" > "hoodie.datasource.write.table.type" = "COPY_ON_WRITE" > {code} > > Spark Configuration - > {code:java} > --num-executors 180 > --executor-cores 4 > --executor-memory 16g > --driver-memory=24g > --conf spark.rdd.compress=true > --queue=default > --conf spark.serializer=org.apache.spark.serializer.KryoSerializer > --conf spark.executor.memoryOverhead=1600 > --conf spark.driver.memoryOverhead=1200 > --conf spark.driver.maxResultSize=2g > --conf spark.kryoserializer.buffer.max=512m > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)