wkhappy1 opened a new issue, #10979: URL: https://github.com/apache/hudi/issues/10979
we has a table of 149 865 845 rows,with 1650 columns,size in hdfs is 27.1 G. table query type is cow. then we use operator code like below to overwrite table val dataFrame = otherTable dataFrame.write.mode(SaveMode.Append) we give spark executor 60g memory,6 executors, each executor has 6 cores use hudi below config overwrite table spark.memory.storageFraction=0.6 "hoodie.datasource.write.operation" -> "insert_overwrite_table", "hoodie.insert.shuffle.parallelism" -> "50", "hoodie.upsert.shuffle.parallelism" -> "50", RECORDKEY_FIELD_OPT_KEY -> "id", PRECOMBINE_FIELD_OPT_KEY -> "ts", PARTITIONPATH_FIELD_OPT_KEY -> "tenant_id", PAYLOAD_CLASS_OPT_KEY -> classOf[OverwriteWithLatestAvroPayload].getName hoodie.parquet.compression.ratio->2.0 hoodie.parquet.max.file.size->41943040 we execute code every 1 hour , But it takes 2 hours for the code to finish executing and cache memory useage below <img width="746" alt="1" src="https://github.com/apache/hudi/assets/54095696/60ed0497-9345-4467-8bc5-d88ee4d2a424"> and Getting ExistingFileIds of all partitions count at HoodieSparkSqlWriter.scala:645 is slow **Environment Description** * Hudi version :0.11.1 * Spark version :3.2.2 * Hive version :3.1.3 * Hadoop version :3.3.2 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org