Hi,

Recently, we often encounter problems using spark sql for inserting data
into a partition table (ex.: insert overwrite table $output_table
partition(dt) select xxx from tmp_table).
After the spark job start running on yarn, *the app will create too many
files (ex. 200w+, or even 1000w+), which will make HDFS under enormous
pressure*.
We found that the num of files created by spark job is depending on the
partition num of hive table that will be inserted and the num of spark sql
partitions.
*files_num = hive_table_partions_num * spark_sql_partitions_num*.
We often make the spark_sql_partitions_num(spark.sql.shuffle.partitions) >=
1000, and the hive_table_partions_num is very small under normal
circumstances, but it will turn out to be more than 2000 when we input a
wrong field as the partion field unconsciously, which will make the
files_num >= 1000 * 2000 = 200w.

There is a configuration parameter in hive that can limit the maximum
number of dynamic partitions allowed to be created in each mapper/reducer
named *hive.exec.max.dynamic.partitions.pernode*, but this conf parameter
did't work when we use hiveContext.

Reducing spark_sql_partitions_num(spark.sql.shuffle.partitions) can make
the files_num be smaller, but it will affect the concurrency.

Can we create configuration parameters to  limit the maximum number of
files allowed to be create by each task or limit the
spark_sql_partitions_num without affect the concurrency?

Reply via email to