pranotishanbhag opened a new issue #2414:
URL: https://github.com/apache/hudi/issues/2414


   **_Tips before filing an issue_**
   
   - Have you gone through our 
[FAQs](https://cwiki.apache.org/confluence/display/HUDI/FAQ)? Yes. Also looked 
at the Tuning guide: 
https://cwiki.apache.org/confluence/display/HUDI/Tuning+Guide
   
   - Join the mailing list to engage in conversations and get faster support at 
dev-subscr...@hudi.apache.org: Have subscribed to this. Also posted this on the 
Slack group and the suggestion was to create a GH issue.
   
   - If you have triaged this as a bug, then file an 
[issue](https://issues.apache.org/jira/projects/HUDI/issues) directly: Not sure 
may be a config issue at my end.
   
   **Describe the problem you faced**
   
   A clear and concise description of the problem.
   
   **To Reproduce**
   
   Steps to reproduce the behavior:
   
   1. Submit a spark job on EMR 5.30 or above. 
   2. Bulk_insert a huge dataset (around 4 TB) with PARTITION_SORT/GLOBAL_SORT 
sort mode.
   3. upsert a delta around 300 GB
   4. Observed the job is taking too much time in step "Getting small files 
from partitions (count at HoodieSparkSqlWriter.scala:389).
   
   **Expected behavior**
   
   I expect a good file size of 128 MB to be created during bulk_insert with 
fewer small files and subsequent upserts to be faster with less/no time taken 
in optimizing small file size.
   
   **Environment Description**
   
   * Hudi version : 0.6
   
   * Spark version : 2.4
   
   * Hive version : 2.5
   
   * Hadoop version : 2.5
   
   * Storage (HDFS/S3/GCS..) : S3
   
   * Running on Docker? (yes/no) : no
   
   
   **Additional context**
   
   I have a dataset of 4 TB size for which I have used "bulk_insert" mode with 
config "hoodie.bulkinsert.shuffle.parallelism" set to 15000. This job took 
about 2 hours and the data is stored in S3 partitioned on 2 columns (uid and 
source). Few questions here:
   1) what is the right value to use for hoodie.bulkinsert.shuffle.parallelism. 
If I go by the logic of 500 MB per partition, the value is 5000 but with this 
my job is very slow. After this, I changed to 15000 and it completed in 2 hours.
   I used partition_sort instead of the default global_sort for 
hoodie.bulkinsert.sort.mode as I wanted the data sorting within a partition. 
With this the job was a few minutes faster than it was with global_sort but I 
saw a variation in file sizes. With partition_sort some file sizes in few 
partitions were too small (1-2 MB) and some were around 60-70 MB. In 
global_sort most files were 60-70 MB. The default file size is supposed to be 
120 MB. Why are we seeing smaller file sizes and what sort mode is better to 
use to get the right file size of 120 MB?
   2) After the bulk_insert, I tried an upsert with 300 GB delta. For this, I 
used the config hoodie.upsert.shuffle.parallelism set to 600 as per the Hudi 
tuning guide of 500 MB per partition. With this the job competed in 1 hour 20 
minutes and about 50 minutes is taken by step "Getting small files from 
partitions (count at HoodieSparkSqlWriter.scala:389)" . I am assuming it is 
because of small files but I am not sure what I have done wrong here. Could 
this be related to the smaller files created in the previous bulk_insert? Or is 
there some setting which I may be missing.
   We expect to get a delta of around 200-500 MB every 12 hours and we run the 
job on EMR - 5.30 for now (will upgrade later after initial hudi experiments). 
I am using Hudi version 6.0.
   
   **Stacktrace**
   
   No exception. Job is taking too much time at step "Getting small files from 
partitions (count at HoodieSparkSqlWriter.scala:389)
   
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Reply via email to