zenglinxi created SPARK-20240: --------------------------------- Summary: SparkSQL support limitations of max dynamic partitions when inserting hive table Key: SPARK-20240 URL: https://issues.apache.org/jira/browse/SPARK-20240 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.1.0, 1.6.2 Reporter: zenglinxi
We found that HDFS problem occurs sometimes when user have a typo in their code while using SparkSQL inserting data into a partition table. For Example: create table: {quote} create table test_tb ( price double, ) PARTITIONED BY (day_partition string ,hour_partition string) {quote} normal sql for inserting table: {quote} insert overwrite table test_tb partition(day_partition, hour_partition) select price, day_partition, hour_partition from other_table; {quote} sql with typo: {quote} insert overwrite table test_tb partition(day_partition, hour_partition) select hour_partition, day_partition, price from other_table; {quote} This typo makes SparkSQL take column "price" as "hour_partition", which may create million HDFS files in short time if the "other_table" has large data with a wide range of "price" and give rise to awful performance of NameNode RPC. We think it's a good idea to limit the maximum number of files allowed to be create by each task for protecting HDFS NameNode from unconscious error. -- This message was sent by Atlassian JIRA (v6.3.15#6346) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org