[ 
https://issues.apache.org/jira/browse/HUDI-1668?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sugamber updated HUDI-1668:
---------------------------
    Description: 
Hi Team,

I'm using bulk insert option to load close to 2 TB data. The process is taking 
near by 2 hours to get completed. While looking at the job log, it is 
identified that [sortBy at 
GlobalSortPartitioner.java:41|https://gdlcuspc1a3-6.us-central1.us.walmart.net:18481/history/application_1614298633248_1444/1/jobs/job?id=1]
 is running twice. 

It is getting triggered at 1 stage. *refer this screenshot ->[^1st.png]*.

Second time it is getting trigged from  *HoodieSparkSqlWriter.scala:433* 
*[count at 
HoodieSparkSqlWriter.scala:433|https://gdlcuspc1a3-6.us-central1.us.walmart.net:18481/history/application_1614298633248_1444/1/jobs/job?id=2]*
   step.

In both cases, same number of job got triggered and running time is close to 
each other. *Refer this screenshot* -> [^2nd.png]

Is there any way to run only one time so that data can be loaded faster.

*Spark and Hudi configurations*

 
{code:java}
Spark - 2.3.0
Scala- 2.11.12
Hudi - 0.7.0
 
{code}
 

Hudi Configuration
{code:java}
"hoodie.cleaner.commits.retained" = 2  
"hoodie.bulkinsert.shuffle.parallelism"=2000  
"hoodie.parquet.small.file.limit" = 100000000  
"hoodie.parquet.max.file.size" = 128000000  
"hoodie.index.bloom.num_entries" = 1800000  
"hoodie.bloom.index.filter.type" = "DYNAMIC_V0"  
"hoodie.bloom.index.filter.dynamic.max.entries" = 2500000  
"hoodie.bloom.index.bucketized.checking" = "false"  
"hoodie.datasource.write.operation" = "bulk_insert"  
"hoodie.datasource.write.table.type" = "COPY_ON_WRITE"
{code}
 

Spark Configuration -
{code:java}
--num-executors 180 
--executor-cores 4 
--executor-memory 16g 
--driver-memory=24g 
--conf spark.rdd.compress=true 
--queue=default 
--conf spark.serializer=org.apache.spark.serializer.KryoSerializer
--conf spark.executor.memoryOverhead=1600 
--conf spark.driver.memoryOverhead=1200 
--conf spark.driver.maxResultSize=2g
--conf spark.kryoserializer.buffer.max=512m 



{code}

  was:
Hi Team,

I'm using bulk insert option to load close to 2 TB data. The process is taking 
near by 2 hours to get completed. While looking at the job log, it is 
identified that [sortBy at 
GlobalSortPartitioner.java:41|https://gdlcuspc1a3-6.us-central1.us.walmart.net:18481/history/application_1614298633248_1444/1/jobs/job?id=1]
 is running twice. 

It is getting triggered at 1 stage. *refer this screenshot ->[^1st.png]*.

Second time it is getting trigged from  *HoodieSparkSqlWriter.scala:433* 
*[count at 
HoodieSparkSqlWriter.scala:433|https://gdlcuspc1a3-6.us-central1.us.walmart.net:18481/history/application_1614298633248_1444/1/jobs/job?id=2]*
   step.

In both cases, same number of job got triggered and running time is close to 
each other. *Refer this screenshot* -> [^2nd.png]

Is there any way to run only one time so that data can be loaded faster.

*Spark and Hudi configurations*

 
{code:java}
Spark - 2.3.0
Scala- 2.11.12
Hudi - 0.7.0
 
{code}
 

Hudi Configuration
{code:java}
"hoodie.cleaner.commits.retained" = 2  
"hoodie.bulkinsert.shuffle.parallelism"=2000  
"hoodie.parquet.small.file.limit" = 100000000  
"hoodie.parquet.max.file.size" = 128000000  
"hoodie.index.bloom.num_entries" = 1800000  
"hoodie.bloom.index.filter.type" = "DYNAMIC_V0"  
"hoodie.bloom.index.filter.dynamic.max.entries" = 2500000  
"hoodie.bloom.index.bucketized.checking" = "false"  
"hoodie.datasource.write.operation" = "bulk_insert"  
"hoodie.datasource.write.table.type" = "COPY_ON_WRITE"
{code}
 

Spark Configuration -
{code:java}
--num-executors 180 
--executor-cores 4 
--executor-memory 16g 
--driver-memory=24g 
--conf spark.rdd.compress=true 
--queue=default 
--conf spark.serializer=org.apache.spark.serializer.KryoSerializer
--conf spark.executor.memoryOverhead=1600 
--conf spark.driver.memoryOverhead=1200 
--conf spark.driver.maxResultSize=2g
--conf spark.kryoserializer.buffer.max=512m 



{code}


> GlobalSortPartitioner is getting called twice during bulk_insert.
> -----------------------------------------------------------------
>
>                 Key: HUDI-1668
>                 URL: https://issues.apache.org/jira/browse/HUDI-1668
>             Project: Apache Hudi
>          Issue Type: Bug
>            Reporter: Sugamber
>            Priority: Minor
>         Attachments: 1st.png, 2nd.png
>
>
> Hi Team,
> I'm using bulk insert option to load close to 2 TB data. The process is 
> taking near by 2 hours to get completed. While looking at the job log, it is 
> identified that [sortBy at 
> GlobalSortPartitioner.java:41|https://gdlcuspc1a3-6.us-central1.us.walmart.net:18481/history/application_1614298633248_1444/1/jobs/job?id=1]
>  is running twice. 
> It is getting triggered at 1 stage. *refer this screenshot ->[^1st.png]*.
> Second time it is getting trigged from  *HoodieSparkSqlWriter.scala:433* 
> *[count at 
> HoodieSparkSqlWriter.scala:433|https://gdlcuspc1a3-6.us-central1.us.walmart.net:18481/history/application_1614298633248_1444/1/jobs/job?id=2]*
>    step.
> In both cases, same number of job got triggered and running time is close to 
> each other. *Refer this screenshot* -> [^2nd.png]
> Is there any way to run only one time so that data can be loaded faster.
> *Spark and Hudi configurations*
>  
> {code:java}
> Spark - 2.3.0
> Scala- 2.11.12
> Hudi - 0.7.0
>  
> {code}
>  
> Hudi Configuration
> {code:java}
> "hoodie.cleaner.commits.retained" = 2  
> "hoodie.bulkinsert.shuffle.parallelism"=2000  
> "hoodie.parquet.small.file.limit" = 100000000  
> "hoodie.parquet.max.file.size" = 128000000  
> "hoodie.index.bloom.num_entries" = 1800000  
> "hoodie.bloom.index.filter.type" = "DYNAMIC_V0"  
> "hoodie.bloom.index.filter.dynamic.max.entries" = 2500000  
> "hoodie.bloom.index.bucketized.checking" = "false"  
> "hoodie.datasource.write.operation" = "bulk_insert"  
> "hoodie.datasource.write.table.type" = "COPY_ON_WRITE"
> {code}
>  
> Spark Configuration -
> {code:java}
> --num-executors 180 
> --executor-cores 4 
> --executor-memory 16g 
> --driver-memory=24g 
> --conf spark.rdd.compress=true 
> --queue=default 
> --conf spark.serializer=org.apache.spark.serializer.KryoSerializer
> --conf spark.executor.memoryOverhead=1600 
> --conf spark.driver.memoryOverhead=1200 
> --conf spark.driver.maxResultSize=2g
> --conf spark.kryoserializer.buffer.max=512m 
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to