[ https://issues.apache.org/jira/browse/HUDI-1668?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17351095#comment-17351095 ]
Sugamber commented on HUDI-1668: -------------------------------- [~nishith29] Yes, We can close this. Thank you!!! > GlobalSortPartitioner is getting called twice during bulk_insert. > ----------------------------------------------------------------- > > Key: HUDI-1668 > URL: https://issues.apache.org/jira/browse/HUDI-1668 > Project: Apache Hudi > Issue Type: Bug > Reporter: Sugamber > Assignee: Nishith Agarwal > Priority: Minor > Labels: sev:high, user-support-issues > Attachments: 1st.png, 2nd.png, Screen Shot 2021-04-17 at 11.23.17 > AM.png, Screenshot 2021-04-21 at 6.40.19 PM.png, Screenshot 2021-04-21 at > 6.40.40 PM.png > > > Hi Team, > I'm using bulk insert option to load close to 2 TB data. The process is > taking near by 2 hours to get completed. While looking at the job log, it is > identified that [sortBy at > GlobalSortPartitioner.java:41|https://gdlcuspc1a3-6.us-central1.us.walmart.net:18481/history/application_1614298633248_1444/1/jobs/job?id=1] > is running twice. > It is getting triggered at 1 stage. *refer this screenshot ->[^1st.png]*. > Second time it is getting trigged from *HoodieSparkSqlWriter.scala:433* > *[count at > HoodieSparkSqlWriter.scala:433|https://gdlcuspc1a3-6.us-central1.us.walmart.net:18481/history/application_1614298633248_1444/1/jobs/job?id=2]* > step. > In both cases, same number of job got triggered and running time is close to > each other. *Refer this screenshot* -> [^2nd.png] > Is there any way to run only one time so that data can be loaded faster or it > is expected behaviour. > *Spark and Hudi configurations* > > {code:java} > Spark - 2.3.0 > Scala- 2.11.12 > Hudi - 0.7.0 > > {code} > > Hudi Configuration > {code:java} > "hoodie.cleaner.commits.retained" = 2 > "hoodie.bulkinsert.shuffle.parallelism"=2000 > "hoodie.parquet.small.file.limit" = 100000000 > "hoodie.parquet.max.file.size" = 128000000 > "hoodie.index.bloom.num_entries" = 1800000 > "hoodie.bloom.index.filter.type" = "DYNAMIC_V0" > "hoodie.bloom.index.filter.dynamic.max.entries" = 2500000 > "hoodie.bloom.index.bucketized.checking" = "false" > "hoodie.datasource.write.operation" = "bulk_insert" > "hoodie.datasource.write.table.type" = "COPY_ON_WRITE" > {code} > > Spark Configuration - > {code:java} > --num-executors 180 > --executor-cores 4 > --executor-memory 16g > --driver-memory=24g > --conf spark.rdd.compress=true > --queue=default > --conf spark.serializer=org.apache.spark.serializer.KryoSerializer > --conf spark.executor.memoryOverhead=1600 > --conf spark.driver.memoryOverhead=1200 > --conf spark.driver.maxResultSize=2g > --conf spark.kryoserializer.buffer.max=512m > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)