codejoyan commented on issue #2620: URL: https://github.com/apache/hudi/issues/2620#issuecomment-791961553
Thanks @bvaradar and @nsivabalan. Please let me know how to improve the performance. I used the below configurations (SIMPLE INDEX and turned off compaction) to speed up the instance and see much improvement: hoodie.parquet.small.file.limit 0 hoodie.index.type SIMPLE But what are the downsides of not using the DEFAULT Bloom filter. In my use-case I would have both late arriving data, so will the performance suffer because of this choice? Also I would like to understand why these specific steps are taking time. From Spark web-UI it seems the execution of the below methods are taking too long. Any insights to understand what is happening in the background please? org.apache.hudi.index.bloom.SparkHoodieBloomIndex.findMatchingFilesForRecordKeys(SparkHoodieBloomIndex.java:266) org.apache.hudi.index.bloom.SparkHoodieBloomIndex.tagLocationBacktoRecords(SparkHoodieBloomIndex.java:287) org.apache.hudi.HoodieSparkSqlWriter$.commitAndPerformPostOperations(HoodieSparkSqlWriter.scala:433) ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org