Sugamber created HUDI-1668:
------------------------------

             Summary: GlobalSortPartitioner is getting called twice during 
bulk_insert.
                 Key: HUDI-1668
                 URL: https://issues.apache.org/jira/browse/HUDI-1668
             Project: Apache Hudi
          Issue Type: Bug
            Reporter: Sugamber
         Attachments: 1st.png, 2nd.png

Hi Team,

I'm using bulk insert option to load close to 2 TB data. The process is taking 
near by 2 hours to get completed. While looking at the job log, it is 
identified that [sortBy at 
GlobalSortPartitioner.java:41|https://gdlcuspc1a3-6.us-central1.us.walmart.net:18481/history/application_1614298633248_1444/1/jobs/job?id=1]
 is running twice. 

It is getting triggered at 1 stage. refer this screenshot.

Second time it is getting trigged from  *HoodieSparkSqlWriter.scala:433* 
*[count at 
HoodieSparkSqlWriter.scala:433|https://gdlcuspc1a3-6.us-central1.us.walmart.net:18481/history/application_1614298633248_1444/1/jobs/job?id=2]*
   step.

In both cases, same number of job got triggered and running time is close to 
each other.



Is there any way to run only one time so that data can be loaded faster.

*Spark and Hudi configurations*

 
{code:java}
Spark - 2.3.0
Scala- 2.11.12
Hudi - 0.7.0
 
{code}
 

Hudi Configuration
{code:java}

"hoodie.cleaner.commits.retained" = 2  
"hoodie.bulkinsert.shuffle.parallelism"=2000  
"hoodie.parquet.small.file.limit" = 100000000  
"hoodie.parquet.max.file.size" = 128000000  
"hoodie.index.bloom.num_entries" = 1800000  
"hoodie.bloom.index.filter.type" = "DYNAMIC_V0"  
"hoodie.bloom.index.filter.dynamic.max.entries" = 2500000  
"hoodie.bloom.index.bucketized.checking" = "false"  
"hoodie.datasource.write.operation" = "bulk_insert"  
"hoodie.datasource.write.table.type" = "COPY_ON_WRITE"
{code}
 

Spark Configuration -
{code:java}
--num-executors 180 
--executor-cores 4 
--executor-memory 16g 
--driver-memory=24g 
--conf spark.rdd.compress=true 
--queue=default 
--conf spark.serializer=org.apache.spark.serializer.KryoSerializer
--conf spark.executor.memoryOverhead=1600 
--conf spark.driver.memoryOverhead=1200 
--conf spark.driver.maxResultSize=2g
--conf spark.kryoserializer.buffer.max=512m 



{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to