[ 
https://issues.apache.org/jira/browse/SPARK-14974?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zenglinxi updated SPARK-14974:
------------------------------
    Description: 
Recently, we often encounter problems using spark sql for inserting data into a 
partition table (ex.: insert overwrite table $output_table partition(dt) select 
xxx from tmp_table).  
After the spark job start running on yarn, the app will create too many files 
(ex. 200w+, or even 1000w+), which will make HDFS under enormous pressure.
We found that the num of files created by spark job is depending on the 
partition num of hive table that will be inserted and the num of spark sql 
partitions. 
files_num = hive_table_partions_num *  spark_sql_partitions_num.
We often make the spark_sql_partitions_num(spark.sql.shuffle.partitions) >= 
1000, and the hive_table_partions_num is very small under normal circumstances, 
but it will turn out to be more than 2000 when we input a wrong field as the 
partion field unconsciously, which will make the files_num >= 1000 * 2000 = 
200w.

> spark sql insert hive table write too many files
> ------------------------------------------------
>
>                 Key: SPARK-14974
>                 URL: https://issues.apache.org/jira/browse/SPARK-14974
>             Project: Spark
>          Issue Type: Task
>          Components: SQL
>    Affects Versions: 1.5.2
>            Reporter: zenglinxi
>
> Recently, we often encounter problems using spark sql for inserting data into 
> a partition table (ex.: insert overwrite table $output_table partition(dt) 
> select xxx from tmp_table).  
> After the spark job start running on yarn, the app will create too many files 
> (ex. 200w+, or even 1000w+), which will make HDFS under enormous pressure.
> We found that the num of files created by spark job is depending on the 
> partition num of hive table that will be inserted and the num of spark sql 
> partitions. 
> files_num = hive_table_partions_num *  spark_sql_partitions_num.
> We often make the spark_sql_partitions_num(spark.sql.shuffle.partitions) >= 
> 1000, and the hive_table_partions_num is very small under normal 
> circumstances, but it will turn out to be more than 2000 when we input a 
> wrong field as the partion field unconsciously, which will make the files_num 
> >= 1000 * 2000 = 200w.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to