[jira] [Commented] (HUDI-1668) GlobalSortPartitioner is getting called twice during bulk_insert.
[ https://issues.apache.org/jira/browse/HUDI-1668?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17351095#comment-17351095 ] Sugamber commented on HUDI-1668: [~nishith29] Yes, We can close this. Thank you!!! > GlobalSortPartitioner is getting called twice during bulk_insert. > - > > Key: HUDI-1668 > URL: https://issues.apache.org/jira/browse/HUDI-1668 > Project: Apache Hudi > Issue Type: Bug >Reporter: Sugamber >Assignee: Nishith Agarwal >Priority: Minor > Labels: sev:high, user-support-issues > Attachments: 1st.png, 2nd.png, Screen Shot 2021-04-17 at 11.23.17 > AM.png, Screenshot 2021-04-21 at 6.40.19 PM.png, Screenshot 2021-04-21 at > 6.40.40 PM.png > > > Hi Team, > I'm using bulk insert option to load close to 2 TB data. The process is > taking near by 2 hours to get completed. While looking at the job log, it is > identified that [sortBy at > GlobalSortPartitioner.java:41|https://gdlcuspc1a3-6.us-central1.us.walmart.net:18481/history/application_1614298633248_1444/1/jobs/job?id=1] > is running twice. > It is getting triggered at 1 stage. *refer this screenshot ->[^1st.png]*. > Second time it is getting trigged from *HoodieSparkSqlWriter.scala:433* > *[count at > HoodieSparkSqlWriter.scala:433|https://gdlcuspc1a3-6.us-central1.us.walmart.net:18481/history/application_1614298633248_1444/1/jobs/job?id=2]* > step. > In both cases, same number of job got triggered and running time is close to > each other. *Refer this screenshot* -> [^2nd.png] > Is there any way to run only one time so that data can be loaded faster or it > is expected behaviour. > *Spark and Hudi configurations* > > {code:java} > Spark - 2.3.0 > Scala- 2.11.12 > Hudi - 0.7.0 > > {code} > > Hudi Configuration > {code:java} > "hoodie.cleaner.commits.retained" = 2 > "hoodie.bulkinsert.shuffle.parallelism"=2000 > "hoodie.parquet.small.file.limit" = 1 > "hoodie.parquet.max.file.size" = 12800 > "hoodie.index.bloom.num_entries" = 180 > "hoodie.bloom.index.filter.type" = "DYNAMIC_V0" > "hoodie.bloom.index.filter.dynamic.max.entries" = 250 > "hoodie.bloom.index.bucketized.checking" = "false" > "hoodie.datasource.write.operation" = "bulk_insert" > "hoodie.datasource.write.table.type" = "COPY_ON_WRITE" > {code} > > Spark Configuration - > {code:java} > --num-executors 180 > --executor-cores 4 > --executor-memory 16g > --driver-memory=24g > --conf spark.rdd.compress=true > --queue=default > --conf spark.serializer=org.apache.spark.serializer.KryoSerializer > --conf spark.executor.memoryOverhead=1600 > --conf spark.driver.memoryOverhead=1200 > --conf spark.driver.maxResultSize=2g > --conf spark.kryoserializer.buffer.max=512m > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Comment Edited] (HUDI-1668) GlobalSortPartitioner is getting called twice during bulk_insert.
[ https://issues.apache.org/jira/browse/HUDI-1668?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17326525#comment-17326525 ] Sugamber edited comment on HUDI-1668 at 4/21/21, 1:14 PM: -- I've attached the both screenshot. !Screenshot 2021-04-21 at 6.40.19 PM.png! !Screenshot 2021-04-21 at 6.40.40 PM.png! was (Author: sugamberku): I've attached the both screenshot. !Screenshot 2021-04-21 at 6.40.19 PM.png! !Screenshot 2021-04-21 at 6.40.40 PM.png! > GlobalSortPartitioner is getting called twice during bulk_insert. > - > > Key: HUDI-1668 > URL: https://issues.apache.org/jira/browse/HUDI-1668 > Project: Apache Hudi > Issue Type: Bug >Reporter: Sugamber >Priority: Minor > Labels: sev:high, user-support-issues > Attachments: 1st.png, 2nd.png, Screen Shot 2021-04-17 at 11.23.17 > AM.png, Screenshot 2021-04-21 at 6.40.19 PM.png, Screenshot 2021-04-21 at > 6.40.40 PM.png > > > Hi Team, > I'm using bulk insert option to load close to 2 TB data. The process is > taking near by 2 hours to get completed. While looking at the job log, it is > identified that [sortBy at > GlobalSortPartitioner.java:41|https://gdlcuspc1a3-6.us-central1.us.walmart.net:18481/history/application_1614298633248_1444/1/jobs/job?id=1] > is running twice. > It is getting triggered at 1 stage. *refer this screenshot ->[^1st.png]*. > Second time it is getting trigged from *HoodieSparkSqlWriter.scala:433* > *[count at > HoodieSparkSqlWriter.scala:433|https://gdlcuspc1a3-6.us-central1.us.walmart.net:18481/history/application_1614298633248_1444/1/jobs/job?id=2]* > step. > In both cases, same number of job got triggered and running time is close to > each other. *Refer this screenshot* -> [^2nd.png] > Is there any way to run only one time so that data can be loaded faster or it > is expected behaviour. > *Spark and Hudi configurations* > > {code:java} > Spark - 2.3.0 > Scala- 2.11.12 > Hudi - 0.7.0 > > {code} > > Hudi Configuration > {code:java} > "hoodie.cleaner.commits.retained" = 2 > "hoodie.bulkinsert.shuffle.parallelism"=2000 > "hoodie.parquet.small.file.limit" = 1 > "hoodie.parquet.max.file.size" = 12800 > "hoodie.index.bloom.num_entries" = 180 > "hoodie.bloom.index.filter.type" = "DYNAMIC_V0" > "hoodie.bloom.index.filter.dynamic.max.entries" = 250 > "hoodie.bloom.index.bucketized.checking" = "false" > "hoodie.datasource.write.operation" = "bulk_insert" > "hoodie.datasource.write.table.type" = "COPY_ON_WRITE" > {code} > > Spark Configuration - > {code:java} > --num-executors 180 > --executor-cores 4 > --executor-memory 16g > --driver-memory=24g > --conf spark.rdd.compress=true > --queue=default > --conf spark.serializer=org.apache.spark.serializer.KryoSerializer > --conf spark.executor.memoryOverhead=1600 > --conf spark.driver.memoryOverhead=1200 > --conf spark.driver.maxResultSize=2g > --conf spark.kryoserializer.buffer.max=512m > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (HUDI-1668) GlobalSortPartitioner is getting called twice during bulk_insert.
[ https://issues.apache.org/jira/browse/HUDI-1668?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17326526#comment-17326526 ] Sugamber commented on HUDI-1668: [~shivnarayan] I see Global sort executed twice in this example. > GlobalSortPartitioner is getting called twice during bulk_insert. > - > > Key: HUDI-1668 > URL: https://issues.apache.org/jira/browse/HUDI-1668 > Project: Apache Hudi > Issue Type: Bug >Reporter: Sugamber >Priority: Minor > Labels: sev:high, user-support-issues > Attachments: 1st.png, 2nd.png, Screen Shot 2021-04-17 at 11.23.17 > AM.png, Screenshot 2021-04-21 at 6.40.19 PM.png, Screenshot 2021-04-21 at > 6.40.40 PM.png > > > Hi Team, > I'm using bulk insert option to load close to 2 TB data. The process is > taking near by 2 hours to get completed. While looking at the job log, it is > identified that [sortBy at > GlobalSortPartitioner.java:41|https://gdlcuspc1a3-6.us-central1.us.walmart.net:18481/history/application_1614298633248_1444/1/jobs/job?id=1] > is running twice. > It is getting triggered at 1 stage. *refer this screenshot ->[^1st.png]*. > Second time it is getting trigged from *HoodieSparkSqlWriter.scala:433* > *[count at > HoodieSparkSqlWriter.scala:433|https://gdlcuspc1a3-6.us-central1.us.walmart.net:18481/history/application_1614298633248_1444/1/jobs/job?id=2]* > step. > In both cases, same number of job got triggered and running time is close to > each other. *Refer this screenshot* -> [^2nd.png] > Is there any way to run only one time so that data can be loaded faster or it > is expected behaviour. > *Spark and Hudi configurations* > > {code:java} > Spark - 2.3.0 > Scala- 2.11.12 > Hudi - 0.7.0 > > {code} > > Hudi Configuration > {code:java} > "hoodie.cleaner.commits.retained" = 2 > "hoodie.bulkinsert.shuffle.parallelism"=2000 > "hoodie.parquet.small.file.limit" = 1 > "hoodie.parquet.max.file.size" = 12800 > "hoodie.index.bloom.num_entries" = 180 > "hoodie.bloom.index.filter.type" = "DYNAMIC_V0" > "hoodie.bloom.index.filter.dynamic.max.entries" = 250 > "hoodie.bloom.index.bucketized.checking" = "false" > "hoodie.datasource.write.operation" = "bulk_insert" > "hoodie.datasource.write.table.type" = "COPY_ON_WRITE" > {code} > > Spark Configuration - > {code:java} > --num-executors 180 > --executor-cores 4 > --executor-memory 16g > --driver-memory=24g > --conf spark.rdd.compress=true > --queue=default > --conf spark.serializer=org.apache.spark.serializer.KryoSerializer > --conf spark.executor.memoryOverhead=1600 > --conf spark.driver.memoryOverhead=1200 > --conf spark.driver.maxResultSize=2g > --conf spark.kryoserializer.buffer.max=512m > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (HUDI-1668) GlobalSortPartitioner is getting called twice during bulk_insert.
[ https://issues.apache.org/jira/browse/HUDI-1668?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17326525#comment-17326525 ] Sugamber commented on HUDI-1668: I've attached the both screenshot. !Screenshot 2021-04-21 at 6.40.19 PM.png! !Screenshot 2021-04-21 at 6.40.40 PM.png! > GlobalSortPartitioner is getting called twice during bulk_insert. > - > > Key: HUDI-1668 > URL: https://issues.apache.org/jira/browse/HUDI-1668 > Project: Apache Hudi > Issue Type: Bug >Reporter: Sugamber >Priority: Minor > Labels: sev:high, user-support-issues > Attachments: 1st.png, 2nd.png, Screen Shot 2021-04-17 at 11.23.17 > AM.png, Screenshot 2021-04-21 at 6.40.19 PM.png > > > Hi Team, > I'm using bulk insert option to load close to 2 TB data. The process is > taking near by 2 hours to get completed. While looking at the job log, it is > identified that [sortBy at > GlobalSortPartitioner.java:41|https://gdlcuspc1a3-6.us-central1.us.walmart.net:18481/history/application_1614298633248_1444/1/jobs/job?id=1] > is running twice. > It is getting triggered at 1 stage. *refer this screenshot ->[^1st.png]*. > Second time it is getting trigged from *HoodieSparkSqlWriter.scala:433* > *[count at > HoodieSparkSqlWriter.scala:433|https://gdlcuspc1a3-6.us-central1.us.walmart.net:18481/history/application_1614298633248_1444/1/jobs/job?id=2]* > step. > In both cases, same number of job got triggered and running time is close to > each other. *Refer this screenshot* -> [^2nd.png] > Is there any way to run only one time so that data can be loaded faster or it > is expected behaviour. > *Spark and Hudi configurations* > > {code:java} > Spark - 2.3.0 > Scala- 2.11.12 > Hudi - 0.7.0 > > {code} > > Hudi Configuration > {code:java} > "hoodie.cleaner.commits.retained" = 2 > "hoodie.bulkinsert.shuffle.parallelism"=2000 > "hoodie.parquet.small.file.limit" = 1 > "hoodie.parquet.max.file.size" = 12800 > "hoodie.index.bloom.num_entries" = 180 > "hoodie.bloom.index.filter.type" = "DYNAMIC_V0" > "hoodie.bloom.index.filter.dynamic.max.entries" = 250 > "hoodie.bloom.index.bucketized.checking" = "false" > "hoodie.datasource.write.operation" = "bulk_insert" > "hoodie.datasource.write.table.type" = "COPY_ON_WRITE" > {code} > > Spark Configuration - > {code:java} > --num-executors 180 > --executor-cores 4 > --executor-memory 16g > --driver-memory=24g > --conf spark.rdd.compress=true > --queue=default > --conf spark.serializer=org.apache.spark.serializer.KryoSerializer > --conf spark.executor.memoryOverhead=1600 > --conf spark.driver.memoryOverhead=1200 > --conf spark.driver.maxResultSize=2g > --conf spark.kryoserializer.buffer.max=512m > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HUDI-1668) GlobalSortPartitioner is getting called twice during bulk_insert.
[ https://issues.apache.org/jira/browse/HUDI-1668?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sugamber updated HUDI-1668: --- Attachment: Screenshot 2021-04-21 at 6.40.19 PM.png > GlobalSortPartitioner is getting called twice during bulk_insert. > - > > Key: HUDI-1668 > URL: https://issues.apache.org/jira/browse/HUDI-1668 > Project: Apache Hudi > Issue Type: Bug >Reporter: Sugamber >Priority: Minor > Labels: sev:high, user-support-issues > Attachments: 1st.png, 2nd.png, Screen Shot 2021-04-17 at 11.23.17 > AM.png, Screenshot 2021-04-21 at 6.40.19 PM.png, Screenshot 2021-04-21 at > 6.40.40 PM.png > > > Hi Team, > I'm using bulk insert option to load close to 2 TB data. The process is > taking near by 2 hours to get completed. While looking at the job log, it is > identified that [sortBy at > GlobalSortPartitioner.java:41|https://gdlcuspc1a3-6.us-central1.us.walmart.net:18481/history/application_1614298633248_1444/1/jobs/job?id=1] > is running twice. > It is getting triggered at 1 stage. *refer this screenshot ->[^1st.png]*. > Second time it is getting trigged from *HoodieSparkSqlWriter.scala:433* > *[count at > HoodieSparkSqlWriter.scala:433|https://gdlcuspc1a3-6.us-central1.us.walmart.net:18481/history/application_1614298633248_1444/1/jobs/job?id=2]* > step. > In both cases, same number of job got triggered and running time is close to > each other. *Refer this screenshot* -> [^2nd.png] > Is there any way to run only one time so that data can be loaded faster or it > is expected behaviour. > *Spark and Hudi configurations* > > {code:java} > Spark - 2.3.0 > Scala- 2.11.12 > Hudi - 0.7.0 > > {code} > > Hudi Configuration > {code:java} > "hoodie.cleaner.commits.retained" = 2 > "hoodie.bulkinsert.shuffle.parallelism"=2000 > "hoodie.parquet.small.file.limit" = 1 > "hoodie.parquet.max.file.size" = 12800 > "hoodie.index.bloom.num_entries" = 180 > "hoodie.bloom.index.filter.type" = "DYNAMIC_V0" > "hoodie.bloom.index.filter.dynamic.max.entries" = 250 > "hoodie.bloom.index.bucketized.checking" = "false" > "hoodie.datasource.write.operation" = "bulk_insert" > "hoodie.datasource.write.table.type" = "COPY_ON_WRITE" > {code} > > Spark Configuration - > {code:java} > --num-executors 180 > --executor-cores 4 > --executor-memory 16g > --driver-memory=24g > --conf spark.rdd.compress=true > --queue=default > --conf spark.serializer=org.apache.spark.serializer.KryoSerializer > --conf spark.executor.memoryOverhead=1600 > --conf spark.driver.memoryOverhead=1200 > --conf spark.driver.maxResultSize=2g > --conf spark.kryoserializer.buffer.max=512m > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HUDI-1668) GlobalSortPartitioner is getting called twice during bulk_insert.
[ https://issues.apache.org/jira/browse/HUDI-1668?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sugamber updated HUDI-1668: --- Attachment: Screenshot 2021-04-21 at 6.40.40 PM.png > GlobalSortPartitioner is getting called twice during bulk_insert. > - > > Key: HUDI-1668 > URL: https://issues.apache.org/jira/browse/HUDI-1668 > Project: Apache Hudi > Issue Type: Bug >Reporter: Sugamber >Priority: Minor > Labels: sev:high, user-support-issues > Attachments: 1st.png, 2nd.png, Screen Shot 2021-04-17 at 11.23.17 > AM.png, Screenshot 2021-04-21 at 6.40.19 PM.png, Screenshot 2021-04-21 at > 6.40.40 PM.png > > > Hi Team, > I'm using bulk insert option to load close to 2 TB data. The process is > taking near by 2 hours to get completed. While looking at the job log, it is > identified that [sortBy at > GlobalSortPartitioner.java:41|https://gdlcuspc1a3-6.us-central1.us.walmart.net:18481/history/application_1614298633248_1444/1/jobs/job?id=1] > is running twice. > It is getting triggered at 1 stage. *refer this screenshot ->[^1st.png]*. > Second time it is getting trigged from *HoodieSparkSqlWriter.scala:433* > *[count at > HoodieSparkSqlWriter.scala:433|https://gdlcuspc1a3-6.us-central1.us.walmart.net:18481/history/application_1614298633248_1444/1/jobs/job?id=2]* > step. > In both cases, same number of job got triggered and running time is close to > each other. *Refer this screenshot* -> [^2nd.png] > Is there any way to run only one time so that data can be loaded faster or it > is expected behaviour. > *Spark and Hudi configurations* > > {code:java} > Spark - 2.3.0 > Scala- 2.11.12 > Hudi - 0.7.0 > > {code} > > Hudi Configuration > {code:java} > "hoodie.cleaner.commits.retained" = 2 > "hoodie.bulkinsert.shuffle.parallelism"=2000 > "hoodie.parquet.small.file.limit" = 1 > "hoodie.parquet.max.file.size" = 12800 > "hoodie.index.bloom.num_entries" = 180 > "hoodie.bloom.index.filter.type" = "DYNAMIC_V0" > "hoodie.bloom.index.filter.dynamic.max.entries" = 250 > "hoodie.bloom.index.bucketized.checking" = "false" > "hoodie.datasource.write.operation" = "bulk_insert" > "hoodie.datasource.write.table.type" = "COPY_ON_WRITE" > {code} > > Spark Configuration - > {code:java} > --num-executors 180 > --executor-cores 4 > --executor-memory 16g > --driver-memory=24g > --conf spark.rdd.compress=true > --queue=default > --conf spark.serializer=org.apache.spark.serializer.KryoSerializer > --conf spark.executor.memoryOverhead=1600 > --conf spark.driver.memoryOverhead=1200 > --conf spark.driver.maxResultSize=2g > --conf spark.kryoserializer.buffer.max=512m > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (HUDI-1668) GlobalSortPartitioner is getting called twice during bulk_insert.
[ https://issues.apache.org/jira/browse/HUDI-1668?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17326397#comment-17326397 ] Sugamber commented on HUDI-1668: [~shivnarayan], I don't have spark 2.4.3 cluster. I'll run the job and share the screenshot of spark UI. > GlobalSortPartitioner is getting called twice during bulk_insert. > - > > Key: HUDI-1668 > URL: https://issues.apache.org/jira/browse/HUDI-1668 > Project: Apache Hudi > Issue Type: Bug >Reporter: Sugamber >Priority: Minor > Labels: sev:high, user-support-issues > Attachments: 1st.png, 2nd.png, Screen Shot 2021-04-17 at 11.23.17 > AM.png > > > Hi Team, > I'm using bulk insert option to load close to 2 TB data. The process is > taking near by 2 hours to get completed. While looking at the job log, it is > identified that [sortBy at > GlobalSortPartitioner.java:41|https://gdlcuspc1a3-6.us-central1.us.walmart.net:18481/history/application_1614298633248_1444/1/jobs/job?id=1] > is running twice. > It is getting triggered at 1 stage. *refer this screenshot ->[^1st.png]*. > Second time it is getting trigged from *HoodieSparkSqlWriter.scala:433* > *[count at > HoodieSparkSqlWriter.scala:433|https://gdlcuspc1a3-6.us-central1.us.walmart.net:18481/history/application_1614298633248_1444/1/jobs/job?id=2]* > step. > In both cases, same number of job got triggered and running time is close to > each other. *Refer this screenshot* -> [^2nd.png] > Is there any way to run only one time so that data can be loaded faster or it > is expected behaviour. > *Spark and Hudi configurations* > > {code:java} > Spark - 2.3.0 > Scala- 2.11.12 > Hudi - 0.7.0 > > {code} > > Hudi Configuration > {code:java} > "hoodie.cleaner.commits.retained" = 2 > "hoodie.bulkinsert.shuffle.parallelism"=2000 > "hoodie.parquet.small.file.limit" = 1 > "hoodie.parquet.max.file.size" = 12800 > "hoodie.index.bloom.num_entries" = 180 > "hoodie.bloom.index.filter.type" = "DYNAMIC_V0" > "hoodie.bloom.index.filter.dynamic.max.entries" = 250 > "hoodie.bloom.index.bucketized.checking" = "false" > "hoodie.datasource.write.operation" = "bulk_insert" > "hoodie.datasource.write.table.type" = "COPY_ON_WRITE" > {code} > > Spark Configuration - > {code:java} > --num-executors 180 > --executor-cores 4 > --executor-memory 16g > --driver-memory=24g > --conf spark.rdd.compress=true > --queue=default > --conf spark.serializer=org.apache.spark.serializer.KryoSerializer > --conf spark.executor.memoryOverhead=1600 > --conf spark.driver.memoryOverhead=1200 > --conf spark.driver.maxResultSize=2g > --conf spark.kryoserializer.buffer.max=512m > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HUDI-1668) GlobalSortPartitioner is getting called twice during bulk_insert.
[ https://issues.apache.org/jira/browse/HUDI-1668?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sugamber updated HUDI-1668: --- Description: Hi Team, I'm using bulk insert option to load close to 2 TB data. The process is taking near by 2 hours to get completed. While looking at the job log, it is identified that [sortBy at GlobalSortPartitioner.java:41|https://gdlcuspc1a3-6.us-central1.us.walmart.net:18481/history/application_1614298633248_1444/1/jobs/job?id=1] is running twice. It is getting triggered at 1 stage. *refer this screenshot ->[^1st.png]*. Second time it is getting trigged from *HoodieSparkSqlWriter.scala:433* *[count at HoodieSparkSqlWriter.scala:433|https://gdlcuspc1a3-6.us-central1.us.walmart.net:18481/history/application_1614298633248_1444/1/jobs/job?id=2]* step. In both cases, same number of job got triggered and running time is close to each other. *Refer this screenshot* -> [^2nd.png] Is there any way to run only one time so that data can be loaded faster or it is expected behaviour. *Spark and Hudi configurations* {code:java} Spark - 2.3.0 Scala- 2.11.12 Hudi - 0.7.0 {code} Hudi Configuration {code:java} "hoodie.cleaner.commits.retained" = 2 "hoodie.bulkinsert.shuffle.parallelism"=2000 "hoodie.parquet.small.file.limit" = 1 "hoodie.parquet.max.file.size" = 12800 "hoodie.index.bloom.num_entries" = 180 "hoodie.bloom.index.filter.type" = "DYNAMIC_V0" "hoodie.bloom.index.filter.dynamic.max.entries" = 250 "hoodie.bloom.index.bucketized.checking" = "false" "hoodie.datasource.write.operation" = "bulk_insert" "hoodie.datasource.write.table.type" = "COPY_ON_WRITE" {code} Spark Configuration - {code:java} --num-executors 180 --executor-cores 4 --executor-memory 16g --driver-memory=24g --conf spark.rdd.compress=true --queue=default --conf spark.serializer=org.apache.spark.serializer.KryoSerializer --conf spark.executor.memoryOverhead=1600 --conf spark.driver.memoryOverhead=1200 --conf spark.driver.maxResultSize=2g --conf spark.kryoserializer.buffer.max=512m {code} was: Hi Team, I'm using bulk insert option to load close to 2 TB data. The process is taking near by 2 hours to get completed. While looking at the job log, it is identified that [sortBy at GlobalSortPartitioner.java:41|https://gdlcuspc1a3-6.us-central1.us.walmart.net:18481/history/application_1614298633248_1444/1/jobs/job?id=1] is running twice. It is getting triggered at 1 stage. *refer this screenshot ->[^1st.png]*. Second time it is getting trigged from *HoodieSparkSqlWriter.scala:433* *[count at HoodieSparkSqlWriter.scala:433|https://gdlcuspc1a3-6.us-central1.us.walmart.net:18481/history/application_1614298633248_1444/1/jobs/job?id=2]* step. In both cases, same number of job got triggered and running time is close to each other. *Refer this screenshot* -> [^2nd.png] Is there any way to run only one time so that data can be loaded faster. *Spark and Hudi configurations* {code:java} Spark - 2.3.0 Scala- 2.11.12 Hudi - 0.7.0 {code} Hudi Configuration {code:java} "hoodie.cleaner.commits.retained" = 2 "hoodie.bulkinsert.shuffle.parallelism"=2000 "hoodie.parquet.small.file.limit" = 1 "hoodie.parquet.max.file.size" = 12800 "hoodie.index.bloom.num_entries" = 180 "hoodie.bloom.index.filter.type" = "DYNAMIC_V0" "hoodie.bloom.index.filter.dynamic.max.entries" = 250 "hoodie.bloom.index.bucketized.checking" = "false" "hoodie.datasource.write.operation" = "bulk_insert" "hoodie.datasource.write.table.type" = "COPY_ON_WRITE" {code} Spark Configuration - {code:java} --num-executors 180 --executor-cores 4 --executor-memory 16g --driver-memory=24g --conf spark.rdd.compress=true --queue=default --conf spark.serializer=org.apache.spark.serializer.KryoSerializer --conf spark.executor.memoryOverhead=1600 --conf spark.driver.memoryOverhead=1200 --conf spark.driver.maxResultSize=2g --conf spark.kryoserializer.buffer.max=512m {code} > GlobalSortPartitioner is getting called twice during bulk_insert. > - > > Key: HUDI-1668 > URL: https://issues.apache.org/jira/browse/HUDI-1668 > Project: Apache Hudi > Issue Type: Bug >Reporter: Sugamber >Priority: Minor > Attachments: 1st.png, 2nd.png > > > Hi Team, > I'm using bulk insert option to load close to 2 TB data. The process is > taking near by 2 hours to get completed. While looking at the job log, it is > identified that [sortBy at > GlobalSortPartitioner.java:41|https://gdlcuspc1a3-6.us-central1.us.walmart.net:18481/history/application_1614298633248_1444/1/jobs/job?id=1] > is running twice. > It is getting triggered at 1 stage. *refer this screenshot ->[^1st.png]*. > Second time it is getting trigged from *HoodieSparkSqlWriter.scala:433*
[jira] [Updated] (HUDI-1668) GlobalSortPartitioner is getting called twice during bulk_insert.
[ https://issues.apache.org/jira/browse/HUDI-1668?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sugamber updated HUDI-1668: --- Description: Hi Team, I'm using bulk insert option to load close to 2 TB data. The process is taking near by 2 hours to get completed. While looking at the job log, it is identified that [sortBy at GlobalSortPartitioner.java:41|https://gdlcuspc1a3-6.us-central1.us.walmart.net:18481/history/application_1614298633248_1444/1/jobs/job?id=1] is running twice. It is getting triggered at 1 stage. *refer this screenshot ->[^1st.png]*. Second time it is getting trigged from *HoodieSparkSqlWriter.scala:433* *[count at HoodieSparkSqlWriter.scala:433|https://gdlcuspc1a3-6.us-central1.us.walmart.net:18481/history/application_1614298633248_1444/1/jobs/job?id=2]* step. In both cases, same number of job got triggered and running time is close to each other. *Refer this screenshot* -> [^2nd.png] Is there any way to run only one time so that data can be loaded faster. *Spark and Hudi configurations* {code:java} Spark - 2.3.0 Scala- 2.11.12 Hudi - 0.7.0 {code} Hudi Configuration {code:java} "hoodie.cleaner.commits.retained" = 2 "hoodie.bulkinsert.shuffle.parallelism"=2000 "hoodie.parquet.small.file.limit" = 1 "hoodie.parquet.max.file.size" = 12800 "hoodie.index.bloom.num_entries" = 180 "hoodie.bloom.index.filter.type" = "DYNAMIC_V0" "hoodie.bloom.index.filter.dynamic.max.entries" = 250 "hoodie.bloom.index.bucketized.checking" = "false" "hoodie.datasource.write.operation" = "bulk_insert" "hoodie.datasource.write.table.type" = "COPY_ON_WRITE" {code} Spark Configuration - {code:java} --num-executors 180 --executor-cores 4 --executor-memory 16g --driver-memory=24g --conf spark.rdd.compress=true --queue=default --conf spark.serializer=org.apache.spark.serializer.KryoSerializer --conf spark.executor.memoryOverhead=1600 --conf spark.driver.memoryOverhead=1200 --conf spark.driver.maxResultSize=2g --conf spark.kryoserializer.buffer.max=512m {code} was: Hi Team, I'm using bulk insert option to load close to 2 TB data. The process is taking near by 2 hours to get completed. While looking at the job log, it is identified that [sortBy at GlobalSortPartitioner.java:41|https://gdlcuspc1a3-6.us-central1.us.walmart.net:18481/history/application_1614298633248_1444/1/jobs/job?id=1] is running twice. It is getting triggered at 1 stage. *refer this screenshot ->[^1st.png]*. Second time it is getting trigged from *HoodieSparkSqlWriter.scala:433* *[count at HoodieSparkSqlWriter.scala:433|https://gdlcuspc1a3-6.us-central1.us.walmart.net:18481/history/application_1614298633248_1444/1/jobs/job?id=2]* step. In both cases, same number of job got triggered and running time is close to each other. *Refer this screenshot* -> [^2nd.png] Is there any way to run only one time so that data can be loaded faster. *Spark and Hudi configurations* {code:java} Spark - 2.3.0 Scala- 2.11.12 Hudi - 0.7.0 {code} Hudi Configuration {code:java} "hoodie.cleaner.commits.retained" = 2 "hoodie.bulkinsert.shuffle.parallelism"=2000 "hoodie.parquet.small.file.limit" = 1 "hoodie.parquet.max.file.size" = 12800 "hoodie.index.bloom.num_entries" = 180 "hoodie.bloom.index.filter.type" = "DYNAMIC_V0" "hoodie.bloom.index.filter.dynamic.max.entries" = 250 "hoodie.bloom.index.bucketized.checking" = "false" "hoodie.datasource.write.operation" = "bulk_insert" "hoodie.datasource.write.table.type" = "COPY_ON_WRITE" {code} Spark Configuration - {code:java} --num-executors 180 --executor-cores 4 --executor-memory 16g --driver-memory=24g --conf spark.rdd.compress=true --queue=default --conf spark.serializer=org.apache.spark.serializer.KryoSerializer --conf spark.executor.memoryOverhead=1600 --conf spark.driver.memoryOverhead=1200 --conf spark.driver.maxResultSize=2g --conf spark.kryoserializer.buffer.max=512m {code} > GlobalSortPartitioner is getting called twice during bulk_insert. > - > > Key: HUDI-1668 > URL: https://issues.apache.org/jira/browse/HUDI-1668 > Project: Apache Hudi > Issue Type: Bug >Reporter: Sugamber >Priority: Minor > Attachments: 1st.png, 2nd.png > > > Hi Team, > I'm using bulk insert option to load close to 2 TB data. The process is > taking near by 2 hours to get completed. While looking at the job log, it is > identified that [sortBy at > GlobalSortPartitioner.java:41|https://gdlcuspc1a3-6.us-central1.us.walmart.net:18481/history/application_1614298633248_1444/1/jobs/job?id=1] > is running twice. > It is getting triggered at 1 stage. *refer this screenshot ->[^1st.png]*. > Second time it is getting trigged from *HoodieSparkSqlWriter.scala:433* > *[count at > HoodieSparkSq
[jira] [Updated] (HUDI-1668) GlobalSortPartitioner is getting called twice during bulk_insert.
[ https://issues.apache.org/jira/browse/HUDI-1668?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sugamber updated HUDI-1668: --- Priority: Minor (was: Major) > GlobalSortPartitioner is getting called twice during bulk_insert. > - > > Key: HUDI-1668 > URL: https://issues.apache.org/jira/browse/HUDI-1668 > Project: Apache Hudi > Issue Type: Bug >Reporter: Sugamber >Priority: Minor > Attachments: 1st.png, 2nd.png > > > Hi Team, > I'm using bulk insert option to load close to 2 TB data. The process is > taking near by 2 hours to get completed. While looking at the job log, it is > identified that [sortBy at > GlobalSortPartitioner.java:41|https://gdlcuspc1a3-6.us-central1.us.walmart.net:18481/history/application_1614298633248_1444/1/jobs/job?id=1] > is running twice. > It is getting triggered at 1 stage. *refer this screenshot ->[^1st.png]*. > Second time it is getting trigged from *HoodieSparkSqlWriter.scala:433* > *[count at > HoodieSparkSqlWriter.scala:433|https://gdlcuspc1a3-6.us-central1.us.walmart.net:18481/history/application_1614298633248_1444/1/jobs/job?id=2]* > step. > In both cases, same number of job got triggered and running time is close to > each other. *Refer this screenshot* -> [^2nd.png] > Is there any way to run only one time so that data can be loaded faster. > *Spark and Hudi configurations* > > {code:java} > Spark - 2.3.0 > Scala- 2.11.12 > Hudi - 0.7.0 > > {code} > > Hudi Configuration > {code:java} > "hoodie.cleaner.commits.retained" = 2 > "hoodie.bulkinsert.shuffle.parallelism"=2000 > "hoodie.parquet.small.file.limit" = 1 > "hoodie.parquet.max.file.size" = 12800 > "hoodie.index.bloom.num_entries" = 180 > "hoodie.bloom.index.filter.type" = "DYNAMIC_V0" > "hoodie.bloom.index.filter.dynamic.max.entries" = 250 > "hoodie.bloom.index.bucketized.checking" = "false" > "hoodie.datasource.write.operation" = "bulk_insert" > "hoodie.datasource.write.table.type" = "COPY_ON_WRITE" > {code} > > Spark Configuration - > {code:java} > --num-executors 180 > --executor-cores 4 > --executor-memory 16g > --driver-memory=24g > --conf spark.rdd.compress=true > --queue=default > --conf spark.serializer=org.apache.spark.serializer.KryoSerializer > --conf spark.executor.memoryOverhead=1600 > --conf spark.driver.memoryOverhead=1200 > --conf spark.driver.maxResultSize=2g > --conf spark.kryoserializer.buffer.max=512m > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HUDI-1668) GlobalSortPartitioner is getting called twice during bulk_insert.
[ https://issues.apache.org/jira/browse/HUDI-1668?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sugamber updated HUDI-1668: --- Description: Hi Team, I'm using bulk insert option to load close to 2 TB data. The process is taking near by 2 hours to get completed. While looking at the job log, it is identified that [sortBy at GlobalSortPartitioner.java:41|https://gdlcuspc1a3-6.us-central1.us.walmart.net:18481/history/application_1614298633248_1444/1/jobs/job?id=1] is running twice. It is getting triggered at 1 stage. *refer this screenshot ->[^1st.png]*. Second time it is getting trigged from *HoodieSparkSqlWriter.scala:433* *[count at HoodieSparkSqlWriter.scala:433|https://gdlcuspc1a3-6.us-central1.us.walmart.net:18481/history/application_1614298633248_1444/1/jobs/job?id=2]* step. In both cases, same number of job got triggered and running time is close to each other. *Refer this screenshot* -> Is there any way to run only one time so that data can be loaded faster. *Spark and Hudi configurations* {code:java} Spark - 2.3.0 Scala- 2.11.12 Hudi - 0.7.0 {code} Hudi Configuration {code:java} "hoodie.cleaner.commits.retained" = 2 "hoodie.bulkinsert.shuffle.parallelism"=2000 "hoodie.parquet.small.file.limit" = 1 "hoodie.parquet.max.file.size" = 12800 "hoodie.index.bloom.num_entries" = 180 "hoodie.bloom.index.filter.type" = "DYNAMIC_V0" "hoodie.bloom.index.filter.dynamic.max.entries" = 250 "hoodie.bloom.index.bucketized.checking" = "false" "hoodie.datasource.write.operation" = "bulk_insert" "hoodie.datasource.write.table.type" = "COPY_ON_WRITE" {code} Spark Configuration - {code:java} --num-executors 180 --executor-cores 4 --executor-memory 16g --driver-memory=24g --conf spark.rdd.compress=true --queue=default --conf spark.serializer=org.apache.spark.serializer.KryoSerializer --conf spark.executor.memoryOverhead=1600 --conf spark.driver.memoryOverhead=1200 --conf spark.driver.maxResultSize=2g --conf spark.kryoserializer.buffer.max=512m {code} was: Hi Team, I'm using bulk insert option to load close to 2 TB data. The process is taking near by 2 hours to get completed. While looking at the job log, it is identified that [sortBy at GlobalSortPartitioner.java:41|https://gdlcuspc1a3-6.us-central1.us.walmart.net:18481/history/application_1614298633248_1444/1/jobs/job?id=1] is running twice. It is getting triggered at 1 stage. *refer this screenshot ->*. Second time it is getting trigged from *HoodieSparkSqlWriter.scala:433* *[count at HoodieSparkSqlWriter.scala:433|https://gdlcuspc1a3-6.us-central1.us.walmart.net:18481/history/application_1614298633248_1444/1/jobs/job?id=2]* step. In both cases, same number of job got triggered and running time is close to each other. Is there any way to run only one time so that data can be loaded faster. *Spark and Hudi configurations* {code:java} Spark - 2.3.0 Scala- 2.11.12 Hudi - 0.7.0 {code} Hudi Configuration {code:java} "hoodie.cleaner.commits.retained" = 2 "hoodie.bulkinsert.shuffle.parallelism"=2000 "hoodie.parquet.small.file.limit" = 1 "hoodie.parquet.max.file.size" = 12800 "hoodie.index.bloom.num_entries" = 180 "hoodie.bloom.index.filter.type" = "DYNAMIC_V0" "hoodie.bloom.index.filter.dynamic.max.entries" = 250 "hoodie.bloom.index.bucketized.checking" = "false" "hoodie.datasource.write.operation" = "bulk_insert" "hoodie.datasource.write.table.type" = "COPY_ON_WRITE" {code} Spark Configuration - {code:java} --num-executors 180 --executor-cores 4 --executor-memory 16g --driver-memory=24g --conf spark.rdd.compress=true --queue=default --conf spark.serializer=org.apache.spark.serializer.KryoSerializer --conf spark.executor.memoryOverhead=1600 --conf spark.driver.memoryOverhead=1200 --conf spark.driver.maxResultSize=2g --conf spark.kryoserializer.buffer.max=512m {code} > GlobalSortPartitioner is getting called twice during bulk_insert. > - > > Key: HUDI-1668 > URL: https://issues.apache.org/jira/browse/HUDI-1668 > Project: Apache Hudi > Issue Type: Bug >Reporter: Sugamber >Priority: Major > Attachments: 1st.png, 2nd.png > > > Hi Team, > I'm using bulk insert option to load close to 2 TB data. The process is > taking near by 2 hours to get completed. While looking at the job log, it is > identified that [sortBy at > GlobalSortPartitioner.java:41|https://gdlcuspc1a3-6.us-central1.us.walmart.net:18481/history/application_1614298633248_1444/1/jobs/job?id=1] > is running twice. > It is getting triggered at 1 stage. *refer this screenshot ->[^1st.png]*. > Second time it is getting trigged from *HoodieSparkSqlWriter.scala:433* > *[count at > HoodieSparkSqlWriter.scala:433|https://gdlcuspc1a3-6.us-central1.us.wal
[jira] [Updated] (HUDI-1668) GlobalSortPartitioner is getting called twice during bulk_insert.
[ https://issues.apache.org/jira/browse/HUDI-1668?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sugamber updated HUDI-1668: --- Description: Hi Team, I'm using bulk insert option to load close to 2 TB data. The process is taking near by 2 hours to get completed. While looking at the job log, it is identified that [sortBy at GlobalSortPartitioner.java:41|https://gdlcuspc1a3-6.us-central1.us.walmart.net:18481/history/application_1614298633248_1444/1/jobs/job?id=1] is running twice. It is getting triggered at 1 stage. *refer this screenshot ->[^1st.png]*. Second time it is getting trigged from *HoodieSparkSqlWriter.scala:433* *[count at HoodieSparkSqlWriter.scala:433|https://gdlcuspc1a3-6.us-central1.us.walmart.net:18481/history/application_1614298633248_1444/1/jobs/job?id=2]* step. In both cases, same number of job got triggered and running time is close to each other. *Refer this screenshot* -> [^2nd.png] Is there any way to run only one time so that data can be loaded faster. *Spark and Hudi configurations* {code:java} Spark - 2.3.0 Scala- 2.11.12 Hudi - 0.7.0 {code} Hudi Configuration {code:java} "hoodie.cleaner.commits.retained" = 2 "hoodie.bulkinsert.shuffle.parallelism"=2000 "hoodie.parquet.small.file.limit" = 1 "hoodie.parquet.max.file.size" = 12800 "hoodie.index.bloom.num_entries" = 180 "hoodie.bloom.index.filter.type" = "DYNAMIC_V0" "hoodie.bloom.index.filter.dynamic.max.entries" = 250 "hoodie.bloom.index.bucketized.checking" = "false" "hoodie.datasource.write.operation" = "bulk_insert" "hoodie.datasource.write.table.type" = "COPY_ON_WRITE" {code} Spark Configuration - {code:java} --num-executors 180 --executor-cores 4 --executor-memory 16g --driver-memory=24g --conf spark.rdd.compress=true --queue=default --conf spark.serializer=org.apache.spark.serializer.KryoSerializer --conf spark.executor.memoryOverhead=1600 --conf spark.driver.memoryOverhead=1200 --conf spark.driver.maxResultSize=2g --conf spark.kryoserializer.buffer.max=512m {code} was: Hi Team, I'm using bulk insert option to load close to 2 TB data. The process is taking near by 2 hours to get completed. While looking at the job log, it is identified that [sortBy at GlobalSortPartitioner.java:41|https://gdlcuspc1a3-6.us-central1.us.walmart.net:18481/history/application_1614298633248_1444/1/jobs/job?id=1] is running twice. It is getting triggered at 1 stage. *refer this screenshot ->[^1st.png]*. Second time it is getting trigged from *HoodieSparkSqlWriter.scala:433* *[count at HoodieSparkSqlWriter.scala:433|https://gdlcuspc1a3-6.us-central1.us.walmart.net:18481/history/application_1614298633248_1444/1/jobs/job?id=2]* step. In both cases, same number of job got triggered and running time is close to each other. *Refer this screenshot* -> Is there any way to run only one time so that data can be loaded faster. *Spark and Hudi configurations* {code:java} Spark - 2.3.0 Scala- 2.11.12 Hudi - 0.7.0 {code} Hudi Configuration {code:java} "hoodie.cleaner.commits.retained" = 2 "hoodie.bulkinsert.shuffle.parallelism"=2000 "hoodie.parquet.small.file.limit" = 1 "hoodie.parquet.max.file.size" = 12800 "hoodie.index.bloom.num_entries" = 180 "hoodie.bloom.index.filter.type" = "DYNAMIC_V0" "hoodie.bloom.index.filter.dynamic.max.entries" = 250 "hoodie.bloom.index.bucketized.checking" = "false" "hoodie.datasource.write.operation" = "bulk_insert" "hoodie.datasource.write.table.type" = "COPY_ON_WRITE" {code} Spark Configuration - {code:java} --num-executors 180 --executor-cores 4 --executor-memory 16g --driver-memory=24g --conf spark.rdd.compress=true --queue=default --conf spark.serializer=org.apache.spark.serializer.KryoSerializer --conf spark.executor.memoryOverhead=1600 --conf spark.driver.memoryOverhead=1200 --conf spark.driver.maxResultSize=2g --conf spark.kryoserializer.buffer.max=512m {code} > GlobalSortPartitioner is getting called twice during bulk_insert. > - > > Key: HUDI-1668 > URL: https://issues.apache.org/jira/browse/HUDI-1668 > Project: Apache Hudi > Issue Type: Bug >Reporter: Sugamber >Priority: Major > Attachments: 1st.png, 2nd.png > > > Hi Team, > I'm using bulk insert option to load close to 2 TB data. The process is > taking near by 2 hours to get completed. While looking at the job log, it is > identified that [sortBy at > GlobalSortPartitioner.java:41|https://gdlcuspc1a3-6.us-central1.us.walmart.net:18481/history/application_1614298633248_1444/1/jobs/job?id=1] > is running twice. > It is getting triggered at 1 stage. *refer this screenshot ->[^1st.png]*. > Second time it is getting trigged from *HoodieSparkSqlWriter.scala:433* > *[count at > HoodieSparkSqlWriter.sc
[jira] [Updated] (HUDI-1668) GlobalSortPartitioner is getting called twice during bulk_insert.
[ https://issues.apache.org/jira/browse/HUDI-1668?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sugamber updated HUDI-1668: --- Description: Hi Team, I'm using bulk insert option to load close to 2 TB data. The process is taking near by 2 hours to get completed. While looking at the job log, it is identified that [sortBy at GlobalSortPartitioner.java:41|https://gdlcuspc1a3-6.us-central1.us.walmart.net:18481/history/application_1614298633248_1444/1/jobs/job?id=1] is running twice. It is getting triggered at 1 stage. *refer this screenshot ->*. Second time it is getting trigged from *HoodieSparkSqlWriter.scala:433* *[count at HoodieSparkSqlWriter.scala:433|https://gdlcuspc1a3-6.us-central1.us.walmart.net:18481/history/application_1614298633248_1444/1/jobs/job?id=2]* step. In both cases, same number of job got triggered and running time is close to each other. Is there any way to run only one time so that data can be loaded faster. *Spark and Hudi configurations* {code:java} Spark - 2.3.0 Scala- 2.11.12 Hudi - 0.7.0 {code} Hudi Configuration {code:java} "hoodie.cleaner.commits.retained" = 2 "hoodie.bulkinsert.shuffle.parallelism"=2000 "hoodie.parquet.small.file.limit" = 1 "hoodie.parquet.max.file.size" = 12800 "hoodie.index.bloom.num_entries" = 180 "hoodie.bloom.index.filter.type" = "DYNAMIC_V0" "hoodie.bloom.index.filter.dynamic.max.entries" = 250 "hoodie.bloom.index.bucketized.checking" = "false" "hoodie.datasource.write.operation" = "bulk_insert" "hoodie.datasource.write.table.type" = "COPY_ON_WRITE" {code} Spark Configuration - {code:java} --num-executors 180 --executor-cores 4 --executor-memory 16g --driver-memory=24g --conf spark.rdd.compress=true --queue=default --conf spark.serializer=org.apache.spark.serializer.KryoSerializer --conf spark.executor.memoryOverhead=1600 --conf spark.driver.memoryOverhead=1200 --conf spark.driver.maxResultSize=2g --conf spark.kryoserializer.buffer.max=512m {code} was: Hi Team, I'm using bulk insert option to load close to 2 TB data. The process is taking near by 2 hours to get completed. While looking at the job log, it is identified that [sortBy at GlobalSortPartitioner.java:41|https://gdlcuspc1a3-6.us-central1.us.walmart.net:18481/history/application_1614298633248_1444/1/jobs/job?id=1] is running twice. It is getting triggered at 1 stage. refer this screenshot. Second time it is getting trigged from *HoodieSparkSqlWriter.scala:433* *[count at HoodieSparkSqlWriter.scala:433|https://gdlcuspc1a3-6.us-central1.us.walmart.net:18481/history/application_1614298633248_1444/1/jobs/job?id=2]* step. In both cases, same number of job got triggered and running time is close to each other. Is there any way to run only one time so that data can be loaded faster. *Spark and Hudi configurations* {code:java} Spark - 2.3.0 Scala- 2.11.12 Hudi - 0.7.0 {code} Hudi Configuration {code:java} "hoodie.cleaner.commits.retained" = 2 "hoodie.bulkinsert.shuffle.parallelism"=2000 "hoodie.parquet.small.file.limit" = 1 "hoodie.parquet.max.file.size" = 12800 "hoodie.index.bloom.num_entries" = 180 "hoodie.bloom.index.filter.type" = "DYNAMIC_V0" "hoodie.bloom.index.filter.dynamic.max.entries" = 250 "hoodie.bloom.index.bucketized.checking" = "false" "hoodie.datasource.write.operation" = "bulk_insert" "hoodie.datasource.write.table.type" = "COPY_ON_WRITE" {code} Spark Configuration - {code:java} --num-executors 180 --executor-cores 4 --executor-memory 16g --driver-memory=24g --conf spark.rdd.compress=true --queue=default --conf spark.serializer=org.apache.spark.serializer.KryoSerializer --conf spark.executor.memoryOverhead=1600 --conf spark.driver.memoryOverhead=1200 --conf spark.driver.maxResultSize=2g --conf spark.kryoserializer.buffer.max=512m {code} > GlobalSortPartitioner is getting called twice during bulk_insert. > - > > Key: HUDI-1668 > URL: https://issues.apache.org/jira/browse/HUDI-1668 > Project: Apache Hudi > Issue Type: Bug >Reporter: Sugamber >Priority: Major > Attachments: 1st.png, 2nd.png > > > Hi Team, > I'm using bulk insert option to load close to 2 TB data. The process is > taking near by 2 hours to get completed. While looking at the job log, it is > identified that [sortBy at > GlobalSortPartitioner.java:41|https://gdlcuspc1a3-6.us-central1.us.walmart.net:18481/history/application_1614298633248_1444/1/jobs/job?id=1] > is running twice. > It is getting triggered at 1 stage. *refer this screenshot ->*. > Second time it is getting trigged from *HoodieSparkSqlWriter.scala:433* > *[count at > HoodieSparkSqlWriter.scala:433|https://gdlcuspc1a3-6.us-central1.us.walmart.net:18481/history/application_1614298633248_1
[jira] [Updated] (HUDI-1668) GlobalSortPartitioner is getting called twice during bulk_insert.
[ https://issues.apache.org/jira/browse/HUDI-1668?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sugamber updated HUDI-1668: --- Attachment: 2nd.png 1st.png > GlobalSortPartitioner is getting called twice during bulk_insert. > - > > Key: HUDI-1668 > URL: https://issues.apache.org/jira/browse/HUDI-1668 > Project: Apache Hudi > Issue Type: Bug >Reporter: Sugamber >Priority: Major > Attachments: 1st.png, 2nd.png > > > Hi Team, > I'm using bulk insert option to load close to 2 TB data. The process is > taking near by 2 hours to get completed. While looking at the job log, it is > identified that [sortBy at > GlobalSortPartitioner.java:41|https://gdlcuspc1a3-6.us-central1.us.walmart.net:18481/history/application_1614298633248_1444/1/jobs/job?id=1] > is running twice. > It is getting triggered at 1 stage. refer this screenshot. > Second time it is getting trigged from *HoodieSparkSqlWriter.scala:433* > *[count at > HoodieSparkSqlWriter.scala:433|https://gdlcuspc1a3-6.us-central1.us.walmart.net:18481/history/application_1614298633248_1444/1/jobs/job?id=2]* > step. > In both cases, same number of job got triggered and running time is close to > each other. > Is there any way to run only one time so that data can be loaded faster. > *Spark and Hudi configurations* > > {code:java} > Spark - 2.3.0 > Scala- 2.11.12 > Hudi - 0.7.0 > > {code} > > Hudi Configuration > {code:java} > "hoodie.cleaner.commits.retained" = 2 > "hoodie.bulkinsert.shuffle.parallelism"=2000 > "hoodie.parquet.small.file.limit" = 1 > "hoodie.parquet.max.file.size" = 12800 > "hoodie.index.bloom.num_entries" = 180 > "hoodie.bloom.index.filter.type" = "DYNAMIC_V0" > "hoodie.bloom.index.filter.dynamic.max.entries" = 250 > "hoodie.bloom.index.bucketized.checking" = "false" > "hoodie.datasource.write.operation" = "bulk_insert" > "hoodie.datasource.write.table.type" = "COPY_ON_WRITE" > {code} > > Spark Configuration - > {code:java} > --num-executors 180 > --executor-cores 4 > --executor-memory 16g > --driver-memory=24g > --conf spark.rdd.compress=true > --queue=default > --conf spark.serializer=org.apache.spark.serializer.KryoSerializer > --conf spark.executor.memoryOverhead=1600 > --conf spark.driver.memoryOverhead=1200 > --conf spark.driver.maxResultSize=2g > --conf spark.kryoserializer.buffer.max=512m > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (HUDI-1668) GlobalSortPartitioner is getting called twice during bulk_insert.
Sugamber created HUDI-1668: -- Summary: GlobalSortPartitioner is getting called twice during bulk_insert. Key: HUDI-1668 URL: https://issues.apache.org/jira/browse/HUDI-1668 Project: Apache Hudi Issue Type: Bug Reporter: Sugamber Attachments: 1st.png, 2nd.png Hi Team, I'm using bulk insert option to load close to 2 TB data. The process is taking near by 2 hours to get completed. While looking at the job log, it is identified that [sortBy at GlobalSortPartitioner.java:41|https://gdlcuspc1a3-6.us-central1.us.walmart.net:18481/history/application_1614298633248_1444/1/jobs/job?id=1] is running twice. It is getting triggered at 1 stage. refer this screenshot. Second time it is getting trigged from *HoodieSparkSqlWriter.scala:433* *[count at HoodieSparkSqlWriter.scala:433|https://gdlcuspc1a3-6.us-central1.us.walmart.net:18481/history/application_1614298633248_1444/1/jobs/job?id=2]* step. In both cases, same number of job got triggered and running time is close to each other. Is there any way to run only one time so that data can be loaded faster. *Spark and Hudi configurations* {code:java} Spark - 2.3.0 Scala- 2.11.12 Hudi - 0.7.0 {code} Hudi Configuration {code:java} "hoodie.cleaner.commits.retained" = 2 "hoodie.bulkinsert.shuffle.parallelism"=2000 "hoodie.parquet.small.file.limit" = 1 "hoodie.parquet.max.file.size" = 12800 "hoodie.index.bloom.num_entries" = 180 "hoodie.bloom.index.filter.type" = "DYNAMIC_V0" "hoodie.bloom.index.filter.dynamic.max.entries" = 250 "hoodie.bloom.index.bucketized.checking" = "false" "hoodie.datasource.write.operation" = "bulk_insert" "hoodie.datasource.write.table.type" = "COPY_ON_WRITE" {code} Spark Configuration - {code:java} --num-executors 180 --executor-cores 4 --executor-memory 16g --driver-memory=24g --conf spark.rdd.compress=true --queue=default --conf spark.serializer=org.apache.spark.serializer.KryoSerializer --conf spark.executor.memoryOverhead=1600 --conf spark.driver.memoryOverhead=1200 --conf spark.driver.maxResultSize=2g --conf spark.kryoserializer.buffer.max=512m {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)