afeldman1 opened a new issue #2639: URL: https://github.com/apache/hudi/issues/2639
Using Hudi 0.6.0, updated from Spark 2.4.7 on EMR 5.32.0 to Spark 3.0.1 on EMR 6.2.0, there is a significant slowdown on writing into Hudi tables. With Spark 2.4.7 the process writing into the Hudi table takes about 6 minutes, while with Spark 3.0.1, the same code takes about 3.4 hours. The table has 60073874 records. The source Spark is reading the data from is a single un-partitioned parquet file. Using AWS Glue as the metastore and S3 as the table file store location. Originally it was writing into a Hudi table with a single partition, but I also attempted to break it up partitioned into groups as see below, with no significant change in performance: ![image](https://user-images.githubusercontent.com/4326577/110195469-3db5d980-7e0b-11eb-880a-76bf96d81c13.png) The Hudi write config is the same with both Spark 2.4.7 and Spark 3.0.1: `DataSourceWriteOptions.HIVE_DATABASE_OPT_KEY -> datalakeDbName, HoodieWriteConfig.TABLE_NAME -> table.tableName, DataSourceWriteOptions.TABLE_TYPE_OPT_KEY -> DataSourceWriteOptions.COW_TABLE_TYPE_OPT_VAL, DataSourceWriteOptions.OPERATION_OPT_KEY -> DataSourceWriteOptions.BULK_INSERT_OPERATION_OPT_VAL, DataSourceWriteOptions.KEYGENERATOR_CLASS_OPT_KEY -> classOf[ComplexKeyGenerator].getName, DataSourceWriteOptions.RECORDKEY_FIELD_OPT_KEY -> table.keyCols.reduce(_ + "," + _), DataSourceWriteOptions.PARTITIONPATH_FIELD_OPT_KEY -> table.partitionCols, DataSourceWriteOptions.PRECOMBINE_FIELD_OPT_KEY -> "ts", DataSourceWriteOptions.HIVE_SYNC_ENABLED_OPT_KEY -> "true", DataSourceWriteOptions.HIVE_TABLE_OPT_KEY -> table.tableName, DataSourceWriteOptions.HIVE_PARTITION_EXTRACTOR_CLASS_OPT_KEY -> classOf[MultiPartKeysValueExtractor].getCanonicalName, DataSourceWriteOptions.HIVE_PARTITION_FIELDS_OPT_KEY -> partitionCols DataSourceWriteOptions.HIVE_URL_OPT_KEY -> getHiveMetastoreJdbcUrl` (also attempted to switch to `DataSourceWriteOptions.INSERT_OPERATION_OPT_VAL` for `DataSourceWriteOptions.OPERATION_OPT_KEY`, but this does not help) It’s spending most time time on the piece “UpsertPartitioner”, “Getting small files from partitions” with spark job “sortBy at GlobalSortPartitioner.java” and “count at HoodieSparkSqlWriter.scala" Is there something wrong with the config, that's causing this extreme increase in hudi processing time? With Spark 3.0.1 Using single partition: ![image](https://user-images.githubusercontent.com/4326577/110195630-7e622280-7e0c-11eb-8439-97ce050b6a6b.png) With Spark 3.0.1 Using multiple partitions: ![image](https://user-images.githubusercontent.com/4326577/110195633-83bf6d00-7e0c-11eb-9b3c-dda61cb13716.png) With Spark 2.4.7: ![image](https://user-images.githubusercontent.com/4326577/110195613-5a064600-7e0c-11eb-8d4b-665b3c0d3f93.png) **Problematic Environment Description** Hudi version : 0.6.0 Spark version : 3.0.1 Storage: AWS S3 Running on EMR ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org