zenglinxi created SPARK-20240:
---------------------------------

             Summary: SparkSQL support limitations of max dynamic partitions 
when inserting hive table
                 Key: SPARK-20240
                 URL: https://issues.apache.org/jira/browse/SPARK-20240
             Project: Spark
          Issue Type: Improvement
          Components: SQL
    Affects Versions: 2.1.0, 1.6.2
            Reporter: zenglinxi


We found that HDFS problem occurs sometimes when user have a typo in their code 
 while using SparkSQL inserting data into a partition table.
For Example:
create table:
 {quote}
 create table test_tb (
   price double,
) PARTITIONED BY (day_partition string ,hour_partition string)
{quote}

normal sql for inserting table:
 {quote}
insert overwrite table test_tb partition(day_partition, hour_partition) select 
price, day_partition, hour_partition from other_table;
 {quote}

sql with typo:
 {quote}
insert overwrite table test_tb partition(day_partition, hour_partition) select 
hour_partition, day_partition, price from other_table;
 {quote}
This typo makes SparkSQL take column "price" as "hour_partition",  which may 
create million HDFS files in short time if the "other_table" has large data 
with a wide range of "price" and give rise to awful performance of NameNode RPC.
We think it's a good idea to limit the maximum number of files allowed to be 
create by each task for protecting HDFS NameNode from unconscious error.




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to