[ 
https://issues.apache.org/jira/browse/HUDI-1668?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sugamber updated HUDI-1668:
---------------------------
    Description: 
Hi Team,

I'm using bulk insert option to load close to 2 TB data. The process is taking 
near by 2 hours to get completed. While looking at the job log, it is 
identified that [sortBy at 
GlobalSortPartitioner.java:41|https://gdlcuspc1a3-6.us-central1.us.walmart.net:18481/history/application_1614298633248_1444/1/jobs/job?id=1]
 is running twice. 

It is getting triggered at 1 stage. *refer this screenshot ->*.

Second time it is getting trigged from  *HoodieSparkSqlWriter.scala:433* 
*[count at 
HoodieSparkSqlWriter.scala:433|https://gdlcuspc1a3-6.us-central1.us.walmart.net:18481/history/application_1614298633248_1444/1/jobs/job?id=2]*
   step.

In both cases, same number of job got triggered and running time is close to 
each other.

Is there any way to run only one time so that data can be loaded faster.

*Spark and Hudi configurations*

 
{code:java}
Spark - 2.3.0
Scala- 2.11.12
Hudi - 0.7.0
 
{code}
 

Hudi Configuration
{code:java}
"hoodie.cleaner.commits.retained" = 2  
"hoodie.bulkinsert.shuffle.parallelism"=2000  
"hoodie.parquet.small.file.limit" = 100000000  
"hoodie.parquet.max.file.size" = 128000000  
"hoodie.index.bloom.num_entries" = 1800000  
"hoodie.bloom.index.filter.type" = "DYNAMIC_V0"  
"hoodie.bloom.index.filter.dynamic.max.entries" = 2500000  
"hoodie.bloom.index.bucketized.checking" = "false"  
"hoodie.datasource.write.operation" = "bulk_insert"  
"hoodie.datasource.write.table.type" = "COPY_ON_WRITE"
{code}
 

Spark Configuration -
{code:java}
--num-executors 180 
--executor-cores 4 
--executor-memory 16g 
--driver-memory=24g 
--conf spark.rdd.compress=true 
--queue=default 
--conf spark.serializer=org.apache.spark.serializer.KryoSerializer
--conf spark.executor.memoryOverhead=1600 
--conf spark.driver.memoryOverhead=1200 
--conf spark.driver.maxResultSize=2g
--conf spark.kryoserializer.buffer.max=512m 



{code}

  was:
Hi Team,

I'm using bulk insert option to load close to 2 TB data. The process is taking 
near by 2 hours to get completed. While looking at the job log, it is 
identified that [sortBy at 
GlobalSortPartitioner.java:41|https://gdlcuspc1a3-6.us-central1.us.walmart.net:18481/history/application_1614298633248_1444/1/jobs/job?id=1]
 is running twice. 

It is getting triggered at 1 stage. refer this screenshot.

Second time it is getting trigged from  *HoodieSparkSqlWriter.scala:433* 
*[count at 
HoodieSparkSqlWriter.scala:433|https://gdlcuspc1a3-6.us-central1.us.walmart.net:18481/history/application_1614298633248_1444/1/jobs/job?id=2]*
   step.

In both cases, same number of job got triggered and running time is close to 
each other.



Is there any way to run only one time so that data can be loaded faster.

*Spark and Hudi configurations*

 
{code:java}
Spark - 2.3.0
Scala- 2.11.12
Hudi - 0.7.0
 
{code}
 

Hudi Configuration
{code:java}

"hoodie.cleaner.commits.retained" = 2  
"hoodie.bulkinsert.shuffle.parallelism"=2000  
"hoodie.parquet.small.file.limit" = 100000000  
"hoodie.parquet.max.file.size" = 128000000  
"hoodie.index.bloom.num_entries" = 1800000  
"hoodie.bloom.index.filter.type" = "DYNAMIC_V0"  
"hoodie.bloom.index.filter.dynamic.max.entries" = 2500000  
"hoodie.bloom.index.bucketized.checking" = "false"  
"hoodie.datasource.write.operation" = "bulk_insert"  
"hoodie.datasource.write.table.type" = "COPY_ON_WRITE"
{code}
 

Spark Configuration -
{code:java}
--num-executors 180 
--executor-cores 4 
--executor-memory 16g 
--driver-memory=24g 
--conf spark.rdd.compress=true 
--queue=default 
--conf spark.serializer=org.apache.spark.serializer.KryoSerializer
--conf spark.executor.memoryOverhead=1600 
--conf spark.driver.memoryOverhead=1200 
--conf spark.driver.maxResultSize=2g
--conf spark.kryoserializer.buffer.max=512m 



{code}


> GlobalSortPartitioner is getting called twice during bulk_insert.
> -----------------------------------------------------------------
>
>                 Key: HUDI-1668
>                 URL: https://issues.apache.org/jira/browse/HUDI-1668
>             Project: Apache Hudi
>          Issue Type: Bug
>            Reporter: Sugamber
>            Priority: Major
>         Attachments: 1st.png, 2nd.png
>
>
> Hi Team,
> I'm using bulk insert option to load close to 2 TB data. The process is 
> taking near by 2 hours to get completed. While looking at the job log, it is 
> identified that [sortBy at 
> GlobalSortPartitioner.java:41|https://gdlcuspc1a3-6.us-central1.us.walmart.net:18481/history/application_1614298633248_1444/1/jobs/job?id=1]
>  is running twice. 
> It is getting triggered at 1 stage. *refer this screenshot ->*.
> Second time it is getting trigged from  *HoodieSparkSqlWriter.scala:433* 
> *[count at 
> HoodieSparkSqlWriter.scala:433|https://gdlcuspc1a3-6.us-central1.us.walmart.net:18481/history/application_1614298633248_1444/1/jobs/job?id=2]*
>    step.
> In both cases, same number of job got triggered and running time is close to 
> each other.
> Is there any way to run only one time so that data can be loaded faster.
> *Spark and Hudi configurations*
>  
> {code:java}
> Spark - 2.3.0
> Scala- 2.11.12
> Hudi - 0.7.0
>  
> {code}
>  
> Hudi Configuration
> {code:java}
> "hoodie.cleaner.commits.retained" = 2  
> "hoodie.bulkinsert.shuffle.parallelism"=2000  
> "hoodie.parquet.small.file.limit" = 100000000  
> "hoodie.parquet.max.file.size" = 128000000  
> "hoodie.index.bloom.num_entries" = 1800000  
> "hoodie.bloom.index.filter.type" = "DYNAMIC_V0"  
> "hoodie.bloom.index.filter.dynamic.max.entries" = 2500000  
> "hoodie.bloom.index.bucketized.checking" = "false"  
> "hoodie.datasource.write.operation" = "bulk_insert"  
> "hoodie.datasource.write.table.type" = "COPY_ON_WRITE"
> {code}
>  
> Spark Configuration -
> {code:java}
> --num-executors 180 
> --executor-cores 4 
> --executor-memory 16g 
> --driver-memory=24g 
> --conf spark.rdd.compress=true 
> --queue=default 
> --conf spark.serializer=org.apache.spark.serializer.KryoSerializer
> --conf spark.executor.memoryOverhead=1600 
> --conf spark.driver.memoryOverhead=1200 
> --conf spark.driver.maxResultSize=2g
> --conf spark.kryoserializer.buffer.max=512m 
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to