BTW, I have created a JIRA task to follow this issue: https://issues.apache.org/jira/browse/SPARK-14974
2016-04-28 18:08 GMT+08:00 linxi zeng <linxizeng0...@gmail.com>: > Hi, > > Recently, we often encounter problems using spark sql for inserting data > into a partition table (ex.: insert overwrite table $output_table > partition(dt) select xxx from tmp_table). > After the spark job start running on yarn, *the app will create too many > files (ex. 200w+, or even 1000w+), which will make HDFS under enormous > pressure*. > We found that the num of files created by spark job is depending on the > partition num of hive table that will be inserted and the num of spark sql > partitions. > *files_num = hive_table_partions_num * spark_sql_partitions_num*. > We often make the spark_sql_partitions_num(spark.sql.shuffle.partitions) > >= 1000, and the hive_table_partions_num is very small under normal > circumstances, but it will turn out to be more than 2000 when we input a > wrong field as the partion field unconsciously, which will make the > files_num >= 1000 * 2000 = 200w. > > There is a configuration parameter in hive that can limit the maximum > number of dynamic partitions allowed to be created in each mapper/reducer > named *hive.exec.max.dynamic.partitions.pernode*, but this conf parameter > did't work when we use hiveContext. > > Reducing spark_sql_partitions_num(spark.sql.shuffle.partitions) can make > the files_num be smaller, but it will affect the concurrency. > > Can we create configuration parameters to limit the maximum number of > files allowed to be create by each task or limit the > spark_sql_partitions_num without affect the concurrency? >