[GitHub] [hudi] codejoyan commented on issue #2620: [SUPPORT] Performance Tuning: Slow stages (Building Workload Profile & Getting Small files from partitions) during Hudi Writes

GitBox Sat, 06 Mar 2021 06:26:12 -0800


codejoyan commented on issue #2620:
URL: https://github.com/apache/hudi/issues/2620#issuecomment-791961553



   Thanks @bvaradar and @nsivabalan. Please let me know how to improve the 
performance.
   I used the below configurations (SIMPLE INDEX and turned off compaction)  to 
speed up the instance and see much improvement:
   hoodie.parquet.small.file.limit 0
   hoodie.index.type SIMPLE
   
   But what are the downsides of not using the DEFAULT Bloom filter. In my 
use-case I would have both late arriving data, so will the performance suffer 
because of this choice?
   
   Also I would like to understand why these specific steps are taking time. 
From Spark web-UI it seems the execution of the below methods are taking too 
long. Any insights to understand what is happening in the background please?
   
   
org.apache.hudi.index.bloom.SparkHoodieBloomIndex.findMatchingFilesForRecordKeys(SparkHoodieBloomIndex.java:266)
   
org.apache.hudi.index.bloom.SparkHoodieBloomIndex.tagLocationBacktoRecords(SparkHoodieBloomIndex.java:287)
   
org.apache.hudi.HoodieSparkSqlWriter$.commitAndPerformPostOperations(HoodieSparkSqlWriter.scala:433)


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] codejoyan commented on issue #2620: [SUPPORT] Performance Tuning: Slow stages (Building Workload Profile & Getting Small files from partitions) during Hudi Writes

Reply via email to