[ https://issues.apache.org/jira/browse/SPARK-14974?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
zenglinxi updated SPARK-14974: ------------------------------ Description: Recently, we often encounter problems using spark sql for inserting data into a partition table (ex.: insert overwrite table $output_table partition(dt) select xxx from tmp_table). After the spark job start running on yarn, the app will create too many files (ex. 2,000,000, or even 10,000,000), which will make HDFS under enormous pressure. We found that the num of files created by spark job is depending on the partition num of hive table that will be inserted and the num of spark sql partitions. files_num = hive_table_partions_num * spark_sql_partitions_num. We often make the spark_sql_partitions_num(spark.sql.shuffle.partitions) >= 1000, and the hive_table_partions_num is very small under normal circumstances, but it will turn out to be more than 2000 when we input a wrong field as the partion field unconsciously, which will make the files_num >= 1000 * 2000 = 2,000,000. There is a configuration parameter in hive that can limit the maximum number of dynamic partitions allowed to be created in each mapper/reducer named hive.exec.max.dynamic.partitions.pernode, but this conf parameter did't work when we use hiveContext. Reducing spark_sql_partitions_num(spark.sql.shuffle.partitions) can make the files_num be smaller, but it will affect the concurrency. was: Recently, we often encounter problems using spark sql for inserting data into a partition table (ex.: insert overwrite table $output_table partition(dt) select xxx from tmp_table). After the spark job start running on yarn, the app will create too many files (ex. 200w+, or even 1000w+), which will make HDFS under enormous pressure. We found that the num of files created by spark job is depending on the partition num of hive table that will be inserted and the num of spark sql partitions. files_num = hive_table_partions_num * spark_sql_partitions_num. We often make the spark_sql_partitions_num(spark.sql.shuffle.partitions) >= 1000, and the hive_table_partions_num is very small under normal circumstances, but it will turn out to be more than 2000 when we input a wrong field as the partion field unconsciously, which will make the files_num >= 1000 * 2000 = 200w. There is a configuration parameter in hive that can limit the maximum number of dynamic partitions allowed to be created in each mapper/reducer named hive.exec.max.dynamic.partitions.pernode, but this conf parameter did't work when we use hiveContext. Reducing spark_sql_partitions_num(spark.sql.shuffle.partitions) can make the files_num be smaller, but it will affect the concurrency. > spark sql job create too many files in HDFS when doing insert overwrite hive > table > ---------------------------------------------------------------------------------- > > Key: SPARK-14974 > URL: https://issues.apache.org/jira/browse/SPARK-14974 > Project: Spark > Issue Type: Task > Components: SQL > Affects Versions: 1.5.2 > Reporter: zenglinxi > > Recently, we often encounter problems using spark sql for inserting data into > a partition table (ex.: insert overwrite table $output_table partition(dt) > select xxx from tmp_table). > After the spark job start running on yarn, the app will create too many files > (ex. 2,000,000, or even 10,000,000), which will make HDFS under enormous > pressure. > We found that the num of files created by spark job is depending on the > partition num of hive table that will be inserted and the num of spark sql > partitions. > files_num = hive_table_partions_num * spark_sql_partitions_num. > We often make the spark_sql_partitions_num(spark.sql.shuffle.partitions) >= > 1000, and the hive_table_partions_num is very small under normal > circumstances, but it will turn out to be more than 2000 when we input a > wrong field as the partion field unconsciously, which will make the files_num > >= 1000 * 2000 = 2,000,000. > There is a configuration parameter in hive that can limit the maximum number > of dynamic partitions allowed to be created in each mapper/reducer named > hive.exec.max.dynamic.partitions.pernode, but this conf parameter did't work > when we use hiveContext. > Reducing spark_sql_partitions_num(spark.sql.shuffle.partitions) can make the > files_num be smaller, but it will affect the concurrency. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org