[jira] [Commented] (HUDI-1668) GlobalSortPartitioner is getting called twice during bulk_insert.

2021-05-25 Thread Sugamber (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-1668?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17351095#comment-17351095
 ] 

Sugamber commented on HUDI-1668:


[~nishith29] Yes, We can close this.

Thank you!!!

> GlobalSortPartitioner is getting called twice during bulk_insert.
> -
>
> Key: HUDI-1668
> URL: https://issues.apache.org/jira/browse/HUDI-1668
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Sugamber
>Assignee: Nishith Agarwal
>Priority: Minor
>  Labels: sev:high, user-support-issues
> Attachments: 1st.png, 2nd.png, Screen Shot 2021-04-17 at 11.23.17 
> AM.png, Screenshot 2021-04-21 at 6.40.19 PM.png, Screenshot 2021-04-21 at 
> 6.40.40 PM.png
>
>
> Hi Team,
> I'm using bulk insert option to load close to 2 TB data. The process is 
> taking near by 2 hours to get completed. While looking at the job log, it is 
> identified that [sortBy at 
> GlobalSortPartitioner.java:41|https://gdlcuspc1a3-6.us-central1.us.walmart.net:18481/history/application_1614298633248_1444/1/jobs/job?id=1]
>  is running twice. 
> It is getting triggered at 1 stage. *refer this screenshot ->[^1st.png]*.
> Second time it is getting trigged from  *HoodieSparkSqlWriter.scala:433* 
> *[count at 
> HoodieSparkSqlWriter.scala:433|https://gdlcuspc1a3-6.us-central1.us.walmart.net:18481/history/application_1614298633248_1444/1/jobs/job?id=2]*
>    step.
> In both cases, same number of job got triggered and running time is close to 
> each other. *Refer this screenshot* -> [^2nd.png]
> Is there any way to run only one time so that data can be loaded faster or it 
> is expected behaviour.
> *Spark and Hudi configurations*
>  
> {code:java}
> Spark - 2.3.0
> Scala- 2.11.12
> Hudi - 0.7.0
>  
> {code}
>  
> Hudi Configuration
> {code:java}
> "hoodie.cleaner.commits.retained" = 2  
> "hoodie.bulkinsert.shuffle.parallelism"=2000  
> "hoodie.parquet.small.file.limit" = 1  
> "hoodie.parquet.max.file.size" = 12800  
> "hoodie.index.bloom.num_entries" = 180  
> "hoodie.bloom.index.filter.type" = "DYNAMIC_V0"  
> "hoodie.bloom.index.filter.dynamic.max.entries" = 250  
> "hoodie.bloom.index.bucketized.checking" = "false"  
> "hoodie.datasource.write.operation" = "bulk_insert"  
> "hoodie.datasource.write.table.type" = "COPY_ON_WRITE"
> {code}
>  
> Spark Configuration -
> {code:java}
> --num-executors 180 
> --executor-cores 4 
> --executor-memory 16g 
> --driver-memory=24g 
> --conf spark.rdd.compress=true 
> --queue=default 
> --conf spark.serializer=org.apache.spark.serializer.KryoSerializer
> --conf spark.executor.memoryOverhead=1600 
> --conf spark.driver.memoryOverhead=1200 
> --conf spark.driver.maxResultSize=2g
> --conf spark.kryoserializer.buffer.max=512m 
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (HUDI-1668) GlobalSortPartitioner is getting called twice during bulk_insert.

2021-04-21 Thread Sugamber (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-1668?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17326525#comment-17326525
 ] 

Sugamber edited comment on HUDI-1668 at 4/21/21, 1:14 PM:
--

I've attached the both screenshot.

!Screenshot 2021-04-21 at 6.40.19 PM.png!

 

!Screenshot 2021-04-21 at 6.40.40 PM.png!  


was (Author: sugamberku):
I've attached the both screenshot.

!Screenshot 2021-04-21 at 6.40.19 PM.png!

 

!Screenshot 2021-04-21 at 6.40.40 PM.png!  

> GlobalSortPartitioner is getting called twice during bulk_insert.
> -
>
> Key: HUDI-1668
> URL: https://issues.apache.org/jira/browse/HUDI-1668
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Sugamber
>Priority: Minor
>  Labels: sev:high, user-support-issues
> Attachments: 1st.png, 2nd.png, Screen Shot 2021-04-17 at 11.23.17 
> AM.png, Screenshot 2021-04-21 at 6.40.19 PM.png, Screenshot 2021-04-21 at 
> 6.40.40 PM.png
>
>
> Hi Team,
> I'm using bulk insert option to load close to 2 TB data. The process is 
> taking near by 2 hours to get completed. While looking at the job log, it is 
> identified that [sortBy at 
> GlobalSortPartitioner.java:41|https://gdlcuspc1a3-6.us-central1.us.walmart.net:18481/history/application_1614298633248_1444/1/jobs/job?id=1]
>  is running twice. 
> It is getting triggered at 1 stage. *refer this screenshot ->[^1st.png]*.
> Second time it is getting trigged from  *HoodieSparkSqlWriter.scala:433* 
> *[count at 
> HoodieSparkSqlWriter.scala:433|https://gdlcuspc1a3-6.us-central1.us.walmart.net:18481/history/application_1614298633248_1444/1/jobs/job?id=2]*
>    step.
> In both cases, same number of job got triggered and running time is close to 
> each other. *Refer this screenshot* -> [^2nd.png]
> Is there any way to run only one time so that data can be loaded faster or it 
> is expected behaviour.
> *Spark and Hudi configurations*
>  
> {code:java}
> Spark - 2.3.0
> Scala- 2.11.12
> Hudi - 0.7.0
>  
> {code}
>  
> Hudi Configuration
> {code:java}
> "hoodie.cleaner.commits.retained" = 2  
> "hoodie.bulkinsert.shuffle.parallelism"=2000  
> "hoodie.parquet.small.file.limit" = 1  
> "hoodie.parquet.max.file.size" = 12800  
> "hoodie.index.bloom.num_entries" = 180  
> "hoodie.bloom.index.filter.type" = "DYNAMIC_V0"  
> "hoodie.bloom.index.filter.dynamic.max.entries" = 250  
> "hoodie.bloom.index.bucketized.checking" = "false"  
> "hoodie.datasource.write.operation" = "bulk_insert"  
> "hoodie.datasource.write.table.type" = "COPY_ON_WRITE"
> {code}
>  
> Spark Configuration -
> {code:java}
> --num-executors 180 
> --executor-cores 4 
> --executor-memory 16g 
> --driver-memory=24g 
> --conf spark.rdd.compress=true 
> --queue=default 
> --conf spark.serializer=org.apache.spark.serializer.KryoSerializer
> --conf spark.executor.memoryOverhead=1600 
> --conf spark.driver.memoryOverhead=1200 
> --conf spark.driver.maxResultSize=2g
> --conf spark.kryoserializer.buffer.max=512m 
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HUDI-1668) GlobalSortPartitioner is getting called twice during bulk_insert.

2021-04-21 Thread Sugamber (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-1668?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17326526#comment-17326526
 ] 

Sugamber commented on HUDI-1668:


[~shivnarayan]  I see Global sort executed twice in this example.

> GlobalSortPartitioner is getting called twice during bulk_insert.
> -
>
> Key: HUDI-1668
> URL: https://issues.apache.org/jira/browse/HUDI-1668
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Sugamber
>Priority: Minor
>  Labels: sev:high, user-support-issues
> Attachments: 1st.png, 2nd.png, Screen Shot 2021-04-17 at 11.23.17 
> AM.png, Screenshot 2021-04-21 at 6.40.19 PM.png, Screenshot 2021-04-21 at 
> 6.40.40 PM.png
>
>
> Hi Team,
> I'm using bulk insert option to load close to 2 TB data. The process is 
> taking near by 2 hours to get completed. While looking at the job log, it is 
> identified that [sortBy at 
> GlobalSortPartitioner.java:41|https://gdlcuspc1a3-6.us-central1.us.walmart.net:18481/history/application_1614298633248_1444/1/jobs/job?id=1]
>  is running twice. 
> It is getting triggered at 1 stage. *refer this screenshot ->[^1st.png]*.
> Second time it is getting trigged from  *HoodieSparkSqlWriter.scala:433* 
> *[count at 
> HoodieSparkSqlWriter.scala:433|https://gdlcuspc1a3-6.us-central1.us.walmart.net:18481/history/application_1614298633248_1444/1/jobs/job?id=2]*
>    step.
> In both cases, same number of job got triggered and running time is close to 
> each other. *Refer this screenshot* -> [^2nd.png]
> Is there any way to run only one time so that data can be loaded faster or it 
> is expected behaviour.
> *Spark and Hudi configurations*
>  
> {code:java}
> Spark - 2.3.0
> Scala- 2.11.12
> Hudi - 0.7.0
>  
> {code}
>  
> Hudi Configuration
> {code:java}
> "hoodie.cleaner.commits.retained" = 2  
> "hoodie.bulkinsert.shuffle.parallelism"=2000  
> "hoodie.parquet.small.file.limit" = 1  
> "hoodie.parquet.max.file.size" = 12800  
> "hoodie.index.bloom.num_entries" = 180  
> "hoodie.bloom.index.filter.type" = "DYNAMIC_V0"  
> "hoodie.bloom.index.filter.dynamic.max.entries" = 250  
> "hoodie.bloom.index.bucketized.checking" = "false"  
> "hoodie.datasource.write.operation" = "bulk_insert"  
> "hoodie.datasource.write.table.type" = "COPY_ON_WRITE"
> {code}
>  
> Spark Configuration -
> {code:java}
> --num-executors 180 
> --executor-cores 4 
> --executor-memory 16g 
> --driver-memory=24g 
> --conf spark.rdd.compress=true 
> --queue=default 
> --conf spark.serializer=org.apache.spark.serializer.KryoSerializer
> --conf spark.executor.memoryOverhead=1600 
> --conf spark.driver.memoryOverhead=1200 
> --conf spark.driver.maxResultSize=2g
> --conf spark.kryoserializer.buffer.max=512m 
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HUDI-1668) GlobalSortPartitioner is getting called twice during bulk_insert.

2021-04-21 Thread Sugamber (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-1668?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17326525#comment-17326525
 ] 

Sugamber commented on HUDI-1668:


I've attached the both screenshot.

!Screenshot 2021-04-21 at 6.40.19 PM.png!

 

!Screenshot 2021-04-21 at 6.40.40 PM.png!  

> GlobalSortPartitioner is getting called twice during bulk_insert.
> -
>
> Key: HUDI-1668
> URL: https://issues.apache.org/jira/browse/HUDI-1668
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Sugamber
>Priority: Minor
>  Labels: sev:high, user-support-issues
> Attachments: 1st.png, 2nd.png, Screen Shot 2021-04-17 at 11.23.17 
> AM.png, Screenshot 2021-04-21 at 6.40.19 PM.png
>
>
> Hi Team,
> I'm using bulk insert option to load close to 2 TB data. The process is 
> taking near by 2 hours to get completed. While looking at the job log, it is 
> identified that [sortBy at 
> GlobalSortPartitioner.java:41|https://gdlcuspc1a3-6.us-central1.us.walmart.net:18481/history/application_1614298633248_1444/1/jobs/job?id=1]
>  is running twice. 
> It is getting triggered at 1 stage. *refer this screenshot ->[^1st.png]*.
> Second time it is getting trigged from  *HoodieSparkSqlWriter.scala:433* 
> *[count at 
> HoodieSparkSqlWriter.scala:433|https://gdlcuspc1a3-6.us-central1.us.walmart.net:18481/history/application_1614298633248_1444/1/jobs/job?id=2]*
>    step.
> In both cases, same number of job got triggered and running time is close to 
> each other. *Refer this screenshot* -> [^2nd.png]
> Is there any way to run only one time so that data can be loaded faster or it 
> is expected behaviour.
> *Spark and Hudi configurations*
>  
> {code:java}
> Spark - 2.3.0
> Scala- 2.11.12
> Hudi - 0.7.0
>  
> {code}
>  
> Hudi Configuration
> {code:java}
> "hoodie.cleaner.commits.retained" = 2  
> "hoodie.bulkinsert.shuffle.parallelism"=2000  
> "hoodie.parquet.small.file.limit" = 1  
> "hoodie.parquet.max.file.size" = 12800  
> "hoodie.index.bloom.num_entries" = 180  
> "hoodie.bloom.index.filter.type" = "DYNAMIC_V0"  
> "hoodie.bloom.index.filter.dynamic.max.entries" = 250  
> "hoodie.bloom.index.bucketized.checking" = "false"  
> "hoodie.datasource.write.operation" = "bulk_insert"  
> "hoodie.datasource.write.table.type" = "COPY_ON_WRITE"
> {code}
>  
> Spark Configuration -
> {code:java}
> --num-executors 180 
> --executor-cores 4 
> --executor-memory 16g 
> --driver-memory=24g 
> --conf spark.rdd.compress=true 
> --queue=default 
> --conf spark.serializer=org.apache.spark.serializer.KryoSerializer
> --conf spark.executor.memoryOverhead=1600 
> --conf spark.driver.memoryOverhead=1200 
> --conf spark.driver.maxResultSize=2g
> --conf spark.kryoserializer.buffer.max=512m 
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-1668) GlobalSortPartitioner is getting called twice during bulk_insert.

2021-04-21 Thread Sugamber (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1668?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sugamber updated HUDI-1668:
---
Attachment: Screenshot 2021-04-21 at 6.40.19 PM.png

> GlobalSortPartitioner is getting called twice during bulk_insert.
> -
>
> Key: HUDI-1668
> URL: https://issues.apache.org/jira/browse/HUDI-1668
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Sugamber
>Priority: Minor
>  Labels: sev:high, user-support-issues
> Attachments: 1st.png, 2nd.png, Screen Shot 2021-04-17 at 11.23.17 
> AM.png, Screenshot 2021-04-21 at 6.40.19 PM.png, Screenshot 2021-04-21 at 
> 6.40.40 PM.png
>
>
> Hi Team,
> I'm using bulk insert option to load close to 2 TB data. The process is 
> taking near by 2 hours to get completed. While looking at the job log, it is 
> identified that [sortBy at 
> GlobalSortPartitioner.java:41|https://gdlcuspc1a3-6.us-central1.us.walmart.net:18481/history/application_1614298633248_1444/1/jobs/job?id=1]
>  is running twice. 
> It is getting triggered at 1 stage. *refer this screenshot ->[^1st.png]*.
> Second time it is getting trigged from  *HoodieSparkSqlWriter.scala:433* 
> *[count at 
> HoodieSparkSqlWriter.scala:433|https://gdlcuspc1a3-6.us-central1.us.walmart.net:18481/history/application_1614298633248_1444/1/jobs/job?id=2]*
>    step.
> In both cases, same number of job got triggered and running time is close to 
> each other. *Refer this screenshot* -> [^2nd.png]
> Is there any way to run only one time so that data can be loaded faster or it 
> is expected behaviour.
> *Spark and Hudi configurations*
>  
> {code:java}
> Spark - 2.3.0
> Scala- 2.11.12
> Hudi - 0.7.0
>  
> {code}
>  
> Hudi Configuration
> {code:java}
> "hoodie.cleaner.commits.retained" = 2  
> "hoodie.bulkinsert.shuffle.parallelism"=2000  
> "hoodie.parquet.small.file.limit" = 1  
> "hoodie.parquet.max.file.size" = 12800  
> "hoodie.index.bloom.num_entries" = 180  
> "hoodie.bloom.index.filter.type" = "DYNAMIC_V0"  
> "hoodie.bloom.index.filter.dynamic.max.entries" = 250  
> "hoodie.bloom.index.bucketized.checking" = "false"  
> "hoodie.datasource.write.operation" = "bulk_insert"  
> "hoodie.datasource.write.table.type" = "COPY_ON_WRITE"
> {code}
>  
> Spark Configuration -
> {code:java}
> --num-executors 180 
> --executor-cores 4 
> --executor-memory 16g 
> --driver-memory=24g 
> --conf spark.rdd.compress=true 
> --queue=default 
> --conf spark.serializer=org.apache.spark.serializer.KryoSerializer
> --conf spark.executor.memoryOverhead=1600 
> --conf spark.driver.memoryOverhead=1200 
> --conf spark.driver.maxResultSize=2g
> --conf spark.kryoserializer.buffer.max=512m 
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-1668) GlobalSortPartitioner is getting called twice during bulk_insert.

2021-04-21 Thread Sugamber (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1668?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sugamber updated HUDI-1668:
---
Attachment: Screenshot 2021-04-21 at 6.40.40 PM.png

> GlobalSortPartitioner is getting called twice during bulk_insert.
> -
>
> Key: HUDI-1668
> URL: https://issues.apache.org/jira/browse/HUDI-1668
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Sugamber
>Priority: Minor
>  Labels: sev:high, user-support-issues
> Attachments: 1st.png, 2nd.png, Screen Shot 2021-04-17 at 11.23.17 
> AM.png, Screenshot 2021-04-21 at 6.40.19 PM.png, Screenshot 2021-04-21 at 
> 6.40.40 PM.png
>
>
> Hi Team,
> I'm using bulk insert option to load close to 2 TB data. The process is 
> taking near by 2 hours to get completed. While looking at the job log, it is 
> identified that [sortBy at 
> GlobalSortPartitioner.java:41|https://gdlcuspc1a3-6.us-central1.us.walmart.net:18481/history/application_1614298633248_1444/1/jobs/job?id=1]
>  is running twice. 
> It is getting triggered at 1 stage. *refer this screenshot ->[^1st.png]*.
> Second time it is getting trigged from  *HoodieSparkSqlWriter.scala:433* 
> *[count at 
> HoodieSparkSqlWriter.scala:433|https://gdlcuspc1a3-6.us-central1.us.walmart.net:18481/history/application_1614298633248_1444/1/jobs/job?id=2]*
>    step.
> In both cases, same number of job got triggered and running time is close to 
> each other. *Refer this screenshot* -> [^2nd.png]
> Is there any way to run only one time so that data can be loaded faster or it 
> is expected behaviour.
> *Spark and Hudi configurations*
>  
> {code:java}
> Spark - 2.3.0
> Scala- 2.11.12
> Hudi - 0.7.0
>  
> {code}
>  
> Hudi Configuration
> {code:java}
> "hoodie.cleaner.commits.retained" = 2  
> "hoodie.bulkinsert.shuffle.parallelism"=2000  
> "hoodie.parquet.small.file.limit" = 1  
> "hoodie.parquet.max.file.size" = 12800  
> "hoodie.index.bloom.num_entries" = 180  
> "hoodie.bloom.index.filter.type" = "DYNAMIC_V0"  
> "hoodie.bloom.index.filter.dynamic.max.entries" = 250  
> "hoodie.bloom.index.bucketized.checking" = "false"  
> "hoodie.datasource.write.operation" = "bulk_insert"  
> "hoodie.datasource.write.table.type" = "COPY_ON_WRITE"
> {code}
>  
> Spark Configuration -
> {code:java}
> --num-executors 180 
> --executor-cores 4 
> --executor-memory 16g 
> --driver-memory=24g 
> --conf spark.rdd.compress=true 
> --queue=default 
> --conf spark.serializer=org.apache.spark.serializer.KryoSerializer
> --conf spark.executor.memoryOverhead=1600 
> --conf spark.driver.memoryOverhead=1200 
> --conf spark.driver.maxResultSize=2g
> --conf spark.kryoserializer.buffer.max=512m 
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HUDI-1668) GlobalSortPartitioner is getting called twice during bulk_insert.

2021-04-21 Thread Sugamber (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-1668?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17326397#comment-17326397
 ] 

Sugamber commented on HUDI-1668:


[~shivnarayan], I don't have spark 2.4.3 cluster. 
I'll run the job and share the screenshot of spark UI. 

> GlobalSortPartitioner is getting called twice during bulk_insert.
> -
>
> Key: HUDI-1668
> URL: https://issues.apache.org/jira/browse/HUDI-1668
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Sugamber
>Priority: Minor
>  Labels: sev:high, user-support-issues
> Attachments: 1st.png, 2nd.png, Screen Shot 2021-04-17 at 11.23.17 
> AM.png
>
>
> Hi Team,
> I'm using bulk insert option to load close to 2 TB data. The process is 
> taking near by 2 hours to get completed. While looking at the job log, it is 
> identified that [sortBy at 
> GlobalSortPartitioner.java:41|https://gdlcuspc1a3-6.us-central1.us.walmart.net:18481/history/application_1614298633248_1444/1/jobs/job?id=1]
>  is running twice. 
> It is getting triggered at 1 stage. *refer this screenshot ->[^1st.png]*.
> Second time it is getting trigged from  *HoodieSparkSqlWriter.scala:433* 
> *[count at 
> HoodieSparkSqlWriter.scala:433|https://gdlcuspc1a3-6.us-central1.us.walmart.net:18481/history/application_1614298633248_1444/1/jobs/job?id=2]*
>    step.
> In both cases, same number of job got triggered and running time is close to 
> each other. *Refer this screenshot* -> [^2nd.png]
> Is there any way to run only one time so that data can be loaded faster or it 
> is expected behaviour.
> *Spark and Hudi configurations*
>  
> {code:java}
> Spark - 2.3.0
> Scala- 2.11.12
> Hudi - 0.7.0
>  
> {code}
>  
> Hudi Configuration
> {code:java}
> "hoodie.cleaner.commits.retained" = 2  
> "hoodie.bulkinsert.shuffle.parallelism"=2000  
> "hoodie.parquet.small.file.limit" = 1  
> "hoodie.parquet.max.file.size" = 12800  
> "hoodie.index.bloom.num_entries" = 180  
> "hoodie.bloom.index.filter.type" = "DYNAMIC_V0"  
> "hoodie.bloom.index.filter.dynamic.max.entries" = 250  
> "hoodie.bloom.index.bucketized.checking" = "false"  
> "hoodie.datasource.write.operation" = "bulk_insert"  
> "hoodie.datasource.write.table.type" = "COPY_ON_WRITE"
> {code}
>  
> Spark Configuration -
> {code:java}
> --num-executors 180 
> --executor-cores 4 
> --executor-memory 16g 
> --driver-memory=24g 
> --conf spark.rdd.compress=true 
> --queue=default 
> --conf spark.serializer=org.apache.spark.serializer.KryoSerializer
> --conf spark.executor.memoryOverhead=1600 
> --conf spark.driver.memoryOverhead=1200 
> --conf spark.driver.maxResultSize=2g
> --conf spark.kryoserializer.buffer.max=512m 
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-1668) GlobalSortPartitioner is getting called twice during bulk_insert.

2021-03-05 Thread Sugamber (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1668?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sugamber updated HUDI-1668:
---
Description: 
Hi Team,

I'm using bulk insert option to load close to 2 TB data. The process is taking 
near by 2 hours to get completed. While looking at the job log, it is 
identified that [sortBy at 
GlobalSortPartitioner.java:41|https://gdlcuspc1a3-6.us-central1.us.walmart.net:18481/history/application_1614298633248_1444/1/jobs/job?id=1]
 is running twice. 

It is getting triggered at 1 stage. *refer this screenshot ->[^1st.png]*.

Second time it is getting trigged from  *HoodieSparkSqlWriter.scala:433* 
*[count at 
HoodieSparkSqlWriter.scala:433|https://gdlcuspc1a3-6.us-central1.us.walmart.net:18481/history/application_1614298633248_1444/1/jobs/job?id=2]*
   step.

In both cases, same number of job got triggered and running time is close to 
each other. *Refer this screenshot* -> [^2nd.png]

Is there any way to run only one time so that data can be loaded faster or it 
is expected behaviour.

*Spark and Hudi configurations*

 
{code:java}
Spark - 2.3.0
Scala- 2.11.12
Hudi - 0.7.0
 
{code}
 

Hudi Configuration
{code:java}
"hoodie.cleaner.commits.retained" = 2  
"hoodie.bulkinsert.shuffle.parallelism"=2000  
"hoodie.parquet.small.file.limit" = 1  
"hoodie.parquet.max.file.size" = 12800  
"hoodie.index.bloom.num_entries" = 180  
"hoodie.bloom.index.filter.type" = "DYNAMIC_V0"  
"hoodie.bloom.index.filter.dynamic.max.entries" = 250  
"hoodie.bloom.index.bucketized.checking" = "false"  
"hoodie.datasource.write.operation" = "bulk_insert"  
"hoodie.datasource.write.table.type" = "COPY_ON_WRITE"
{code}
 

Spark Configuration -
{code:java}
--num-executors 180 
--executor-cores 4 
--executor-memory 16g 
--driver-memory=24g 
--conf spark.rdd.compress=true 
--queue=default 
--conf spark.serializer=org.apache.spark.serializer.KryoSerializer
--conf spark.executor.memoryOverhead=1600 
--conf spark.driver.memoryOverhead=1200 
--conf spark.driver.maxResultSize=2g
--conf spark.kryoserializer.buffer.max=512m 



{code}

  was:
Hi Team,

I'm using bulk insert option to load close to 2 TB data. The process is taking 
near by 2 hours to get completed. While looking at the job log, it is 
identified that [sortBy at 
GlobalSortPartitioner.java:41|https://gdlcuspc1a3-6.us-central1.us.walmart.net:18481/history/application_1614298633248_1444/1/jobs/job?id=1]
 is running twice. 

It is getting triggered at 1 stage. *refer this screenshot ->[^1st.png]*.

Second time it is getting trigged from  *HoodieSparkSqlWriter.scala:433* 
*[count at 
HoodieSparkSqlWriter.scala:433|https://gdlcuspc1a3-6.us-central1.us.walmart.net:18481/history/application_1614298633248_1444/1/jobs/job?id=2]*
   step.

In both cases, same number of job got triggered and running time is close to 
each other. *Refer this screenshot* -> [^2nd.png]

Is there any way to run only one time so that data can be loaded faster.

*Spark and Hudi configurations*

 
{code:java}
Spark - 2.3.0
Scala- 2.11.12
Hudi - 0.7.0
 
{code}
 

Hudi Configuration
{code:java}
"hoodie.cleaner.commits.retained" = 2  
"hoodie.bulkinsert.shuffle.parallelism"=2000  
"hoodie.parquet.small.file.limit" = 1  
"hoodie.parquet.max.file.size" = 12800  
"hoodie.index.bloom.num_entries" = 180  
"hoodie.bloom.index.filter.type" = "DYNAMIC_V0"  
"hoodie.bloom.index.filter.dynamic.max.entries" = 250  
"hoodie.bloom.index.bucketized.checking" = "false"  
"hoodie.datasource.write.operation" = "bulk_insert"  
"hoodie.datasource.write.table.type" = "COPY_ON_WRITE"
{code}
 

Spark Configuration -
{code:java}
--num-executors 180 
--executor-cores 4 
--executor-memory 16g 
--driver-memory=24g 
--conf spark.rdd.compress=true 
--queue=default 
--conf spark.serializer=org.apache.spark.serializer.KryoSerializer
--conf spark.executor.memoryOverhead=1600 
--conf spark.driver.memoryOverhead=1200 
--conf spark.driver.maxResultSize=2g
--conf spark.kryoserializer.buffer.max=512m 



{code}


> GlobalSortPartitioner is getting called twice during bulk_insert.
> -
>
> Key: HUDI-1668
> URL: https://issues.apache.org/jira/browse/HUDI-1668
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Sugamber
>Priority: Minor
> Attachments: 1st.png, 2nd.png
>
>
> Hi Team,
> I'm using bulk insert option to load close to 2 TB data. The process is 
> taking near by 2 hours to get completed. While looking at the job log, it is 
> identified that [sortBy at 
> GlobalSortPartitioner.java:41|https://gdlcuspc1a3-6.us-central1.us.walmart.net:18481/history/application_1614298633248_1444/1/jobs/job?id=1]
>  is running twice. 
> It is getting triggered at 1 stage. *refer this screenshot ->[^1st.png]*.
> Second time it is getting trigged from  *HoodieSparkSqlWriter.scala:433* 

[jira] [Updated] (HUDI-1668) GlobalSortPartitioner is getting called twice during bulk_insert.

2021-03-05 Thread Sugamber (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1668?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sugamber updated HUDI-1668:
---
Description: 
Hi Team,

I'm using bulk insert option to load close to 2 TB data. The process is taking 
near by 2 hours to get completed. While looking at the job log, it is 
identified that [sortBy at 
GlobalSortPartitioner.java:41|https://gdlcuspc1a3-6.us-central1.us.walmart.net:18481/history/application_1614298633248_1444/1/jobs/job?id=1]
 is running twice. 

It is getting triggered at 1 stage. *refer this screenshot ->[^1st.png]*.

Second time it is getting trigged from  *HoodieSparkSqlWriter.scala:433* 
*[count at 
HoodieSparkSqlWriter.scala:433|https://gdlcuspc1a3-6.us-central1.us.walmart.net:18481/history/application_1614298633248_1444/1/jobs/job?id=2]*
   step.

In both cases, same number of job got triggered and running time is close to 
each other. *Refer this screenshot* -> [^2nd.png]

Is there any way to run only one time so that data can be loaded faster.

*Spark and Hudi configurations*

 
{code:java}
Spark - 2.3.0
Scala- 2.11.12
Hudi - 0.7.0
 
{code}
 

Hudi Configuration
{code:java}
"hoodie.cleaner.commits.retained" = 2  
"hoodie.bulkinsert.shuffle.parallelism"=2000  
"hoodie.parquet.small.file.limit" = 1  
"hoodie.parquet.max.file.size" = 12800  
"hoodie.index.bloom.num_entries" = 180  
"hoodie.bloom.index.filter.type" = "DYNAMIC_V0"  
"hoodie.bloom.index.filter.dynamic.max.entries" = 250  
"hoodie.bloom.index.bucketized.checking" = "false"  
"hoodie.datasource.write.operation" = "bulk_insert"  
"hoodie.datasource.write.table.type" = "COPY_ON_WRITE"
{code}
 

Spark Configuration -
{code:java}
--num-executors 180 
--executor-cores 4 
--executor-memory 16g 
--driver-memory=24g 
--conf spark.rdd.compress=true 
--queue=default 
--conf spark.serializer=org.apache.spark.serializer.KryoSerializer
--conf spark.executor.memoryOverhead=1600 
--conf spark.driver.memoryOverhead=1200 
--conf spark.driver.maxResultSize=2g
--conf spark.kryoserializer.buffer.max=512m 



{code}

  was:
Hi Team,

I'm using bulk insert option to load close to 2 TB data. The process is taking 
near by 2 hours to get completed. While looking at the job log, it is 
identified that [sortBy at 
GlobalSortPartitioner.java:41|https://gdlcuspc1a3-6.us-central1.us.walmart.net:18481/history/application_1614298633248_1444/1/jobs/job?id=1]
 is running twice. 

It is getting triggered at 1 stage. *refer this screenshot ->[^1st.png]*.

Second time it is getting trigged from  *HoodieSparkSqlWriter.scala:433* 
*[count at 
HoodieSparkSqlWriter.scala:433|https://gdlcuspc1a3-6.us-central1.us.walmart.net:18481/history/application_1614298633248_1444/1/jobs/job?id=2]*
   step.

In both cases, same number of job got triggered and running time is close to 
each other. *Refer this screenshot* -> [^2nd.png]

Is there any way to run only one time so that data can be loaded faster.

*Spark and Hudi configurations*

 
{code:java}
Spark - 2.3.0
Scala- 2.11.12
Hudi - 0.7.0
 
{code}
 

Hudi Configuration
{code:java}
"hoodie.cleaner.commits.retained" = 2  
"hoodie.bulkinsert.shuffle.parallelism"=2000  
"hoodie.parquet.small.file.limit" = 1  
"hoodie.parquet.max.file.size" = 12800  
"hoodie.index.bloom.num_entries" = 180  
"hoodie.bloom.index.filter.type" = "DYNAMIC_V0"  
"hoodie.bloom.index.filter.dynamic.max.entries" = 250  
"hoodie.bloom.index.bucketized.checking" = "false"  
"hoodie.datasource.write.operation" = "bulk_insert"  
"hoodie.datasource.write.table.type" = "COPY_ON_WRITE"
{code}
 

Spark Configuration -
{code:java}
--num-executors 180 
--executor-cores 4 
--executor-memory 16g 
--driver-memory=24g 
--conf spark.rdd.compress=true 
--queue=default 
--conf spark.serializer=org.apache.spark.serializer.KryoSerializer
--conf spark.executor.memoryOverhead=1600 
--conf spark.driver.memoryOverhead=1200 
--conf spark.driver.maxResultSize=2g
--conf spark.kryoserializer.buffer.max=512m 



{code}


> GlobalSortPartitioner is getting called twice during bulk_insert.
> -
>
> Key: HUDI-1668
> URL: https://issues.apache.org/jira/browse/HUDI-1668
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Sugamber
>Priority: Minor
> Attachments: 1st.png, 2nd.png
>
>
> Hi Team,
> I'm using bulk insert option to load close to 2 TB data. The process is 
> taking near by 2 hours to get completed. While looking at the job log, it is 
> identified that [sortBy at 
> GlobalSortPartitioner.java:41|https://gdlcuspc1a3-6.us-central1.us.walmart.net:18481/history/application_1614298633248_1444/1/jobs/job?id=1]
>  is running twice. 
> It is getting triggered at 1 stage. *refer this screenshot ->[^1st.png]*.
> Second time it is getting trigged from  *HoodieSparkSqlWriter.scala:433* 
> *[count at 
> HoodieSparkSq

[jira] [Updated] (HUDI-1668) GlobalSortPartitioner is getting called twice during bulk_insert.

2021-03-05 Thread Sugamber (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1668?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sugamber updated HUDI-1668:
---
Priority: Minor  (was: Major)

> GlobalSortPartitioner is getting called twice during bulk_insert.
> -
>
> Key: HUDI-1668
> URL: https://issues.apache.org/jira/browse/HUDI-1668
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Sugamber
>Priority: Minor
> Attachments: 1st.png, 2nd.png
>
>
> Hi Team,
> I'm using bulk insert option to load close to 2 TB data. The process is 
> taking near by 2 hours to get completed. While looking at the job log, it is 
> identified that [sortBy at 
> GlobalSortPartitioner.java:41|https://gdlcuspc1a3-6.us-central1.us.walmart.net:18481/history/application_1614298633248_1444/1/jobs/job?id=1]
>  is running twice. 
> It is getting triggered at 1 stage. *refer this screenshot ->[^1st.png]*.
> Second time it is getting trigged from  *HoodieSparkSqlWriter.scala:433* 
> *[count at 
> HoodieSparkSqlWriter.scala:433|https://gdlcuspc1a3-6.us-central1.us.walmart.net:18481/history/application_1614298633248_1444/1/jobs/job?id=2]*
>    step.
> In both cases, same number of job got triggered and running time is close to 
> each other. *Refer this screenshot* -> [^2nd.png]
> Is there any way to run only one time so that data can be loaded faster.
> *Spark and Hudi configurations*
>  
> {code:java}
> Spark - 2.3.0
> Scala- 2.11.12
> Hudi - 0.7.0
>  
> {code}
>  
> Hudi Configuration
> {code:java}
> "hoodie.cleaner.commits.retained" = 2  
> "hoodie.bulkinsert.shuffle.parallelism"=2000  
> "hoodie.parquet.small.file.limit" = 1  
> "hoodie.parquet.max.file.size" = 12800  
> "hoodie.index.bloom.num_entries" = 180  
> "hoodie.bloom.index.filter.type" = "DYNAMIC_V0"  
> "hoodie.bloom.index.filter.dynamic.max.entries" = 250  
> "hoodie.bloom.index.bucketized.checking" = "false"  
> "hoodie.datasource.write.operation" = "bulk_insert"  
> "hoodie.datasource.write.table.type" = "COPY_ON_WRITE"
> {code}
>  
> Spark Configuration -
> {code:java}
> --num-executors 180 
> --executor-cores 4 
> --executor-memory 16g 
> --driver-memory=24g 
> --conf spark.rdd.compress=true 
> --queue=default 
> --conf spark.serializer=org.apache.spark.serializer.KryoSerializer
> --conf spark.executor.memoryOverhead=1600 
> --conf spark.driver.memoryOverhead=1200 
> --conf spark.driver.maxResultSize=2g
> --conf spark.kryoserializer.buffer.max=512m 
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-1668) GlobalSortPartitioner is getting called twice during bulk_insert.

2021-03-05 Thread Sugamber (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1668?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sugamber updated HUDI-1668:
---
Description: 
Hi Team,

I'm using bulk insert option to load close to 2 TB data. The process is taking 
near by 2 hours to get completed. While looking at the job log, it is 
identified that [sortBy at 
GlobalSortPartitioner.java:41|https://gdlcuspc1a3-6.us-central1.us.walmart.net:18481/history/application_1614298633248_1444/1/jobs/job?id=1]
 is running twice. 

It is getting triggered at 1 stage. *refer this screenshot ->[^1st.png]*.

Second time it is getting trigged from  *HoodieSparkSqlWriter.scala:433* 
*[count at 
HoodieSparkSqlWriter.scala:433|https://gdlcuspc1a3-6.us-central1.us.walmart.net:18481/history/application_1614298633248_1444/1/jobs/job?id=2]*
   step.

In both cases, same number of job got triggered and running time is close to 
each other. *Refer this screenshot* -> 

Is there any way to run only one time so that data can be loaded faster.

*Spark and Hudi configurations*

 
{code:java}
Spark - 2.3.0
Scala- 2.11.12
Hudi - 0.7.0
 
{code}
 

Hudi Configuration
{code:java}
"hoodie.cleaner.commits.retained" = 2  
"hoodie.bulkinsert.shuffle.parallelism"=2000  
"hoodie.parquet.small.file.limit" = 1  
"hoodie.parquet.max.file.size" = 12800  
"hoodie.index.bloom.num_entries" = 180  
"hoodie.bloom.index.filter.type" = "DYNAMIC_V0"  
"hoodie.bloom.index.filter.dynamic.max.entries" = 250  
"hoodie.bloom.index.bucketized.checking" = "false"  
"hoodie.datasource.write.operation" = "bulk_insert"  
"hoodie.datasource.write.table.type" = "COPY_ON_WRITE"
{code}
 

Spark Configuration -
{code:java}
--num-executors 180 
--executor-cores 4 
--executor-memory 16g 
--driver-memory=24g 
--conf spark.rdd.compress=true 
--queue=default 
--conf spark.serializer=org.apache.spark.serializer.KryoSerializer
--conf spark.executor.memoryOverhead=1600 
--conf spark.driver.memoryOverhead=1200 
--conf spark.driver.maxResultSize=2g
--conf spark.kryoserializer.buffer.max=512m 



{code}

  was:
Hi Team,

I'm using bulk insert option to load close to 2 TB data. The process is taking 
near by 2 hours to get completed. While looking at the job log, it is 
identified that [sortBy at 
GlobalSortPartitioner.java:41|https://gdlcuspc1a3-6.us-central1.us.walmart.net:18481/history/application_1614298633248_1444/1/jobs/job?id=1]
 is running twice. 

It is getting triggered at 1 stage. *refer this screenshot ->*.

Second time it is getting trigged from  *HoodieSparkSqlWriter.scala:433* 
*[count at 
HoodieSparkSqlWriter.scala:433|https://gdlcuspc1a3-6.us-central1.us.walmart.net:18481/history/application_1614298633248_1444/1/jobs/job?id=2]*
   step.

In both cases, same number of job got triggered and running time is close to 
each other.

Is there any way to run only one time so that data can be loaded faster.

*Spark and Hudi configurations*

 
{code:java}
Spark - 2.3.0
Scala- 2.11.12
Hudi - 0.7.0
 
{code}
 

Hudi Configuration
{code:java}
"hoodie.cleaner.commits.retained" = 2  
"hoodie.bulkinsert.shuffle.parallelism"=2000  
"hoodie.parquet.small.file.limit" = 1  
"hoodie.parquet.max.file.size" = 12800  
"hoodie.index.bloom.num_entries" = 180  
"hoodie.bloom.index.filter.type" = "DYNAMIC_V0"  
"hoodie.bloom.index.filter.dynamic.max.entries" = 250  
"hoodie.bloom.index.bucketized.checking" = "false"  
"hoodie.datasource.write.operation" = "bulk_insert"  
"hoodie.datasource.write.table.type" = "COPY_ON_WRITE"
{code}
 

Spark Configuration -
{code:java}
--num-executors 180 
--executor-cores 4 
--executor-memory 16g 
--driver-memory=24g 
--conf spark.rdd.compress=true 
--queue=default 
--conf spark.serializer=org.apache.spark.serializer.KryoSerializer
--conf spark.executor.memoryOverhead=1600 
--conf spark.driver.memoryOverhead=1200 
--conf spark.driver.maxResultSize=2g
--conf spark.kryoserializer.buffer.max=512m 



{code}


> GlobalSortPartitioner is getting called twice during bulk_insert.
> -
>
> Key: HUDI-1668
> URL: https://issues.apache.org/jira/browse/HUDI-1668
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Sugamber
>Priority: Major
> Attachments: 1st.png, 2nd.png
>
>
> Hi Team,
> I'm using bulk insert option to load close to 2 TB data. The process is 
> taking near by 2 hours to get completed. While looking at the job log, it is 
> identified that [sortBy at 
> GlobalSortPartitioner.java:41|https://gdlcuspc1a3-6.us-central1.us.walmart.net:18481/history/application_1614298633248_1444/1/jobs/job?id=1]
>  is running twice. 
> It is getting triggered at 1 stage. *refer this screenshot ->[^1st.png]*.
> Second time it is getting trigged from  *HoodieSparkSqlWriter.scala:433* 
> *[count at 
> HoodieSparkSqlWriter.scala:433|https://gdlcuspc1a3-6.us-central1.us.wal

[jira] [Updated] (HUDI-1668) GlobalSortPartitioner is getting called twice during bulk_insert.

2021-03-05 Thread Sugamber (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1668?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sugamber updated HUDI-1668:
---
Description: 
Hi Team,

I'm using bulk insert option to load close to 2 TB data. The process is taking 
near by 2 hours to get completed. While looking at the job log, it is 
identified that [sortBy at 
GlobalSortPartitioner.java:41|https://gdlcuspc1a3-6.us-central1.us.walmart.net:18481/history/application_1614298633248_1444/1/jobs/job?id=1]
 is running twice. 

It is getting triggered at 1 stage. *refer this screenshot ->[^1st.png]*.

Second time it is getting trigged from  *HoodieSparkSqlWriter.scala:433* 
*[count at 
HoodieSparkSqlWriter.scala:433|https://gdlcuspc1a3-6.us-central1.us.walmart.net:18481/history/application_1614298633248_1444/1/jobs/job?id=2]*
   step.

In both cases, same number of job got triggered and running time is close to 
each other. *Refer this screenshot* -> [^2nd.png]

Is there any way to run only one time so that data can be loaded faster.

*Spark and Hudi configurations*

 
{code:java}
Spark - 2.3.0
Scala- 2.11.12
Hudi - 0.7.0
 
{code}
 

Hudi Configuration
{code:java}
"hoodie.cleaner.commits.retained" = 2  
"hoodie.bulkinsert.shuffle.parallelism"=2000  
"hoodie.parquet.small.file.limit" = 1  
"hoodie.parquet.max.file.size" = 12800  
"hoodie.index.bloom.num_entries" = 180  
"hoodie.bloom.index.filter.type" = "DYNAMIC_V0"  
"hoodie.bloom.index.filter.dynamic.max.entries" = 250  
"hoodie.bloom.index.bucketized.checking" = "false"  
"hoodie.datasource.write.operation" = "bulk_insert"  
"hoodie.datasource.write.table.type" = "COPY_ON_WRITE"
{code}
 

Spark Configuration -
{code:java}
--num-executors 180 
--executor-cores 4 
--executor-memory 16g 
--driver-memory=24g 
--conf spark.rdd.compress=true 
--queue=default 
--conf spark.serializer=org.apache.spark.serializer.KryoSerializer
--conf spark.executor.memoryOverhead=1600 
--conf spark.driver.memoryOverhead=1200 
--conf spark.driver.maxResultSize=2g
--conf spark.kryoserializer.buffer.max=512m 



{code}

  was:
Hi Team,

I'm using bulk insert option to load close to 2 TB data. The process is taking 
near by 2 hours to get completed. While looking at the job log, it is 
identified that [sortBy at 
GlobalSortPartitioner.java:41|https://gdlcuspc1a3-6.us-central1.us.walmart.net:18481/history/application_1614298633248_1444/1/jobs/job?id=1]
 is running twice. 

It is getting triggered at 1 stage. *refer this screenshot ->[^1st.png]*.

Second time it is getting trigged from  *HoodieSparkSqlWriter.scala:433* 
*[count at 
HoodieSparkSqlWriter.scala:433|https://gdlcuspc1a3-6.us-central1.us.walmart.net:18481/history/application_1614298633248_1444/1/jobs/job?id=2]*
   step.

In both cases, same number of job got triggered and running time is close to 
each other. *Refer this screenshot* -> 

Is there any way to run only one time so that data can be loaded faster.

*Spark and Hudi configurations*

 
{code:java}
Spark - 2.3.0
Scala- 2.11.12
Hudi - 0.7.0
 
{code}
 

Hudi Configuration
{code:java}
"hoodie.cleaner.commits.retained" = 2  
"hoodie.bulkinsert.shuffle.parallelism"=2000  
"hoodie.parquet.small.file.limit" = 1  
"hoodie.parquet.max.file.size" = 12800  
"hoodie.index.bloom.num_entries" = 180  
"hoodie.bloom.index.filter.type" = "DYNAMIC_V0"  
"hoodie.bloom.index.filter.dynamic.max.entries" = 250  
"hoodie.bloom.index.bucketized.checking" = "false"  
"hoodie.datasource.write.operation" = "bulk_insert"  
"hoodie.datasource.write.table.type" = "COPY_ON_WRITE"
{code}
 

Spark Configuration -
{code:java}
--num-executors 180 
--executor-cores 4 
--executor-memory 16g 
--driver-memory=24g 
--conf spark.rdd.compress=true 
--queue=default 
--conf spark.serializer=org.apache.spark.serializer.KryoSerializer
--conf spark.executor.memoryOverhead=1600 
--conf spark.driver.memoryOverhead=1200 
--conf spark.driver.maxResultSize=2g
--conf spark.kryoserializer.buffer.max=512m 



{code}


> GlobalSortPartitioner is getting called twice during bulk_insert.
> -
>
> Key: HUDI-1668
> URL: https://issues.apache.org/jira/browse/HUDI-1668
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Sugamber
>Priority: Major
> Attachments: 1st.png, 2nd.png
>
>
> Hi Team,
> I'm using bulk insert option to load close to 2 TB data. The process is 
> taking near by 2 hours to get completed. While looking at the job log, it is 
> identified that [sortBy at 
> GlobalSortPartitioner.java:41|https://gdlcuspc1a3-6.us-central1.us.walmart.net:18481/history/application_1614298633248_1444/1/jobs/job?id=1]
>  is running twice. 
> It is getting triggered at 1 stage. *refer this screenshot ->[^1st.png]*.
> Second time it is getting trigged from  *HoodieSparkSqlWriter.scala:433* 
> *[count at 
> HoodieSparkSqlWriter.sc

[jira] [Updated] (HUDI-1668) GlobalSortPartitioner is getting called twice during bulk_insert.

2021-03-05 Thread Sugamber (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1668?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sugamber updated HUDI-1668:
---
Description: 
Hi Team,

I'm using bulk insert option to load close to 2 TB data. The process is taking 
near by 2 hours to get completed. While looking at the job log, it is 
identified that [sortBy at 
GlobalSortPartitioner.java:41|https://gdlcuspc1a3-6.us-central1.us.walmart.net:18481/history/application_1614298633248_1444/1/jobs/job?id=1]
 is running twice. 

It is getting triggered at 1 stage. *refer this screenshot ->*.

Second time it is getting trigged from  *HoodieSparkSqlWriter.scala:433* 
*[count at 
HoodieSparkSqlWriter.scala:433|https://gdlcuspc1a3-6.us-central1.us.walmart.net:18481/history/application_1614298633248_1444/1/jobs/job?id=2]*
   step.

In both cases, same number of job got triggered and running time is close to 
each other.

Is there any way to run only one time so that data can be loaded faster.

*Spark and Hudi configurations*

 
{code:java}
Spark - 2.3.0
Scala- 2.11.12
Hudi - 0.7.0
 
{code}
 

Hudi Configuration
{code:java}
"hoodie.cleaner.commits.retained" = 2  
"hoodie.bulkinsert.shuffle.parallelism"=2000  
"hoodie.parquet.small.file.limit" = 1  
"hoodie.parquet.max.file.size" = 12800  
"hoodie.index.bloom.num_entries" = 180  
"hoodie.bloom.index.filter.type" = "DYNAMIC_V0"  
"hoodie.bloom.index.filter.dynamic.max.entries" = 250  
"hoodie.bloom.index.bucketized.checking" = "false"  
"hoodie.datasource.write.operation" = "bulk_insert"  
"hoodie.datasource.write.table.type" = "COPY_ON_WRITE"
{code}
 

Spark Configuration -
{code:java}
--num-executors 180 
--executor-cores 4 
--executor-memory 16g 
--driver-memory=24g 
--conf spark.rdd.compress=true 
--queue=default 
--conf spark.serializer=org.apache.spark.serializer.KryoSerializer
--conf spark.executor.memoryOverhead=1600 
--conf spark.driver.memoryOverhead=1200 
--conf spark.driver.maxResultSize=2g
--conf spark.kryoserializer.buffer.max=512m 



{code}

  was:
Hi Team,

I'm using bulk insert option to load close to 2 TB data. The process is taking 
near by 2 hours to get completed. While looking at the job log, it is 
identified that [sortBy at 
GlobalSortPartitioner.java:41|https://gdlcuspc1a3-6.us-central1.us.walmart.net:18481/history/application_1614298633248_1444/1/jobs/job?id=1]
 is running twice. 

It is getting triggered at 1 stage. refer this screenshot.

Second time it is getting trigged from  *HoodieSparkSqlWriter.scala:433* 
*[count at 
HoodieSparkSqlWriter.scala:433|https://gdlcuspc1a3-6.us-central1.us.walmart.net:18481/history/application_1614298633248_1444/1/jobs/job?id=2]*
   step.

In both cases, same number of job got triggered and running time is close to 
each other.



Is there any way to run only one time so that data can be loaded faster.

*Spark and Hudi configurations*

 
{code:java}
Spark - 2.3.0
Scala- 2.11.12
Hudi - 0.7.0
 
{code}
 

Hudi Configuration
{code:java}

"hoodie.cleaner.commits.retained" = 2  
"hoodie.bulkinsert.shuffle.parallelism"=2000  
"hoodie.parquet.small.file.limit" = 1  
"hoodie.parquet.max.file.size" = 12800  
"hoodie.index.bloom.num_entries" = 180  
"hoodie.bloom.index.filter.type" = "DYNAMIC_V0"  
"hoodie.bloom.index.filter.dynamic.max.entries" = 250  
"hoodie.bloom.index.bucketized.checking" = "false"  
"hoodie.datasource.write.operation" = "bulk_insert"  
"hoodie.datasource.write.table.type" = "COPY_ON_WRITE"
{code}
 

Spark Configuration -
{code:java}
--num-executors 180 
--executor-cores 4 
--executor-memory 16g 
--driver-memory=24g 
--conf spark.rdd.compress=true 
--queue=default 
--conf spark.serializer=org.apache.spark.serializer.KryoSerializer
--conf spark.executor.memoryOverhead=1600 
--conf spark.driver.memoryOverhead=1200 
--conf spark.driver.maxResultSize=2g
--conf spark.kryoserializer.buffer.max=512m 



{code}


> GlobalSortPartitioner is getting called twice during bulk_insert.
> -
>
> Key: HUDI-1668
> URL: https://issues.apache.org/jira/browse/HUDI-1668
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Sugamber
>Priority: Major
> Attachments: 1st.png, 2nd.png
>
>
> Hi Team,
> I'm using bulk insert option to load close to 2 TB data. The process is 
> taking near by 2 hours to get completed. While looking at the job log, it is 
> identified that [sortBy at 
> GlobalSortPartitioner.java:41|https://gdlcuspc1a3-6.us-central1.us.walmart.net:18481/history/application_1614298633248_1444/1/jobs/job?id=1]
>  is running twice. 
> It is getting triggered at 1 stage. *refer this screenshot ->*.
> Second time it is getting trigged from  *HoodieSparkSqlWriter.scala:433* 
> *[count at 
> HoodieSparkSqlWriter.scala:433|https://gdlcuspc1a3-6.us-central1.us.walmart.net:18481/history/application_1614298633248_1

[jira] [Updated] (HUDI-1668) GlobalSortPartitioner is getting called twice during bulk_insert.

2021-03-05 Thread Sugamber (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1668?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sugamber updated HUDI-1668:
---
Attachment: 2nd.png
1st.png

> GlobalSortPartitioner is getting called twice during bulk_insert.
> -
>
> Key: HUDI-1668
> URL: https://issues.apache.org/jira/browse/HUDI-1668
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Sugamber
>Priority: Major
> Attachments: 1st.png, 2nd.png
>
>
> Hi Team,
> I'm using bulk insert option to load close to 2 TB data. The process is 
> taking near by 2 hours to get completed. While looking at the job log, it is 
> identified that [sortBy at 
> GlobalSortPartitioner.java:41|https://gdlcuspc1a3-6.us-central1.us.walmart.net:18481/history/application_1614298633248_1444/1/jobs/job?id=1]
>  is running twice. 
> It is getting triggered at 1 stage. refer this screenshot.
> Second time it is getting trigged from  *HoodieSparkSqlWriter.scala:433* 
> *[count at 
> HoodieSparkSqlWriter.scala:433|https://gdlcuspc1a3-6.us-central1.us.walmart.net:18481/history/application_1614298633248_1444/1/jobs/job?id=2]*
>    step.
> In both cases, same number of job got triggered and running time is close to 
> each other.
> Is there any way to run only one time so that data can be loaded faster.
> *Spark and Hudi configurations*
>  
> {code:java}
> Spark - 2.3.0
> Scala- 2.11.12
> Hudi - 0.7.0
>  
> {code}
>  
> Hudi Configuration
> {code:java}
> "hoodie.cleaner.commits.retained" = 2  
> "hoodie.bulkinsert.shuffle.parallelism"=2000  
> "hoodie.parquet.small.file.limit" = 1  
> "hoodie.parquet.max.file.size" = 12800  
> "hoodie.index.bloom.num_entries" = 180  
> "hoodie.bloom.index.filter.type" = "DYNAMIC_V0"  
> "hoodie.bloom.index.filter.dynamic.max.entries" = 250  
> "hoodie.bloom.index.bucketized.checking" = "false"  
> "hoodie.datasource.write.operation" = "bulk_insert"  
> "hoodie.datasource.write.table.type" = "COPY_ON_WRITE"
> {code}
>  
> Spark Configuration -
> {code:java}
> --num-executors 180 
> --executor-cores 4 
> --executor-memory 16g 
> --driver-memory=24g 
> --conf spark.rdd.compress=true 
> --queue=default 
> --conf spark.serializer=org.apache.spark.serializer.KryoSerializer
> --conf spark.executor.memoryOverhead=1600 
> --conf spark.driver.memoryOverhead=1200 
> --conf spark.driver.maxResultSize=2g
> --conf spark.kryoserializer.buffer.max=512m 
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HUDI-1668) GlobalSortPartitioner is getting called twice during bulk_insert.

2021-03-05 Thread Sugamber (Jira)
Sugamber created HUDI-1668:
--

 Summary: GlobalSortPartitioner is getting called twice during 
bulk_insert.
 Key: HUDI-1668
 URL: https://issues.apache.org/jira/browse/HUDI-1668
 Project: Apache Hudi
  Issue Type: Bug
Reporter: Sugamber
 Attachments: 1st.png, 2nd.png

Hi Team,

I'm using bulk insert option to load close to 2 TB data. The process is taking 
near by 2 hours to get completed. While looking at the job log, it is 
identified that [sortBy at 
GlobalSortPartitioner.java:41|https://gdlcuspc1a3-6.us-central1.us.walmart.net:18481/history/application_1614298633248_1444/1/jobs/job?id=1]
 is running twice. 

It is getting triggered at 1 stage. refer this screenshot.

Second time it is getting trigged from  *HoodieSparkSqlWriter.scala:433* 
*[count at 
HoodieSparkSqlWriter.scala:433|https://gdlcuspc1a3-6.us-central1.us.walmart.net:18481/history/application_1614298633248_1444/1/jobs/job?id=2]*
   step.

In both cases, same number of job got triggered and running time is close to 
each other.



Is there any way to run only one time so that data can be loaded faster.

*Spark and Hudi configurations*

 
{code:java}
Spark - 2.3.0
Scala- 2.11.12
Hudi - 0.7.0
 
{code}
 

Hudi Configuration
{code:java}

"hoodie.cleaner.commits.retained" = 2  
"hoodie.bulkinsert.shuffle.parallelism"=2000  
"hoodie.parquet.small.file.limit" = 1  
"hoodie.parquet.max.file.size" = 12800  
"hoodie.index.bloom.num_entries" = 180  
"hoodie.bloom.index.filter.type" = "DYNAMIC_V0"  
"hoodie.bloom.index.filter.dynamic.max.entries" = 250  
"hoodie.bloom.index.bucketized.checking" = "false"  
"hoodie.datasource.write.operation" = "bulk_insert"  
"hoodie.datasource.write.table.type" = "COPY_ON_WRITE"
{code}
 

Spark Configuration -
{code:java}
--num-executors 180 
--executor-cores 4 
--executor-memory 16g 
--driver-memory=24g 
--conf spark.rdd.compress=true 
--queue=default 
--conf spark.serializer=org.apache.spark.serializer.KryoSerializer
--conf spark.executor.memoryOverhead=1600 
--conf spark.driver.memoryOverhead=1200 
--conf spark.driver.maxResultSize=2g
--conf spark.kryoserializer.buffer.max=512m 



{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)