[ 
https://issues.apache.org/jira/browse/SPARK-6644?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

dongxu updated SPARK-6644:
--------------------------
    Description: 
In hive,the schema of partition may be difference from the table schema. For 
example, we add new column. When we use spark-sql to query the data of 
partition which schema is difference from the table schema.
some problems is solved(https://github.com/apache/spark/pull/4289), 
but if you add a new column,put new data into the old partition,new column 
value is NULL

[According to the following steps]:

case class TestData(key: Int, value: String)
val testData = TestHive.sparkContext.parallelize((1 to 10).map(i => TestData(i, 
i.toString))).toDF()
      testData.registerTempTable("testData")

 sql("DROP TABLE IF EXISTS table_with_partition ")

 sql(s"CREATE  TABLE  IF NOT EXISTS  table_with_partition(key int,value string) 
PARTITIONED by (ds string) location '${tmpDir.toURI.toString}' ")
 sql("INSERT OVERWRITE TABLE table_with_partition  partition (ds='1') SELECT 
key,value FROM testData")
    // add column to table
 sql("ALTER TABLE table_with_partition ADD COLUMNS(key1 string)")
 sql("ALTER TABLE table_with_partition ADD COLUMNS(destlng double)") 
 sql("INSERT OVERWRITE TABLE table_with_partition  partition (ds='1') SELECT 
key,value,'test',1.11 FROM testData")
 sql("select * from table_with_partition where ds='1' 
").collect().foreach(println)      
 
result : 
[1,1,null,null,1]
[2,2,null,null,1]
 
result we expect:
[1,1,test,1.11,1]
[2,2,test,1.11,1]

This bug will cause the wrong queyr number ,when we query like that: 

select  count(1)  from  table_with_partition  where   key1  is not NULL

  was:
In hive,the schema of partition may be difference from the table schema. For 
example, we add new column. When we use spark-sql to query the data of 
partition which schema is difference from the table schema.
some problems is solved(https://github.com/apache/spark/pull/4289), 
but if you add a new column,put new data into the old partition,new column 
value is NULL

[According to the following steps]:

case class TestData(key: Int, value: String)
val testData = TestHive.sparkContext.parallelize(
      (1 to 10).map(i => TestData(i, i.toString))).toDF()
testData.registerTempTable("testData")
 sql("DROP TABLE IF EXISTS table_with_partition ")
 sql(s"CREATE  TABLE  IF NOT EXISTS  table_with_partition(key int,value string) 
PARTITIONED by (ds string) location '${tmpDir.toURI.toString}' ")
 sql("INSERT OVERWRITE TABLE table_with_partition  partition (ds='1') SELECT 
key,value FROM testData")
    // add column to table
 sql("ALTER TABLE table_with_partition ADD COLUMNS(key1 string)")
 sql("ALTER TABLE table_with_partition ADD COLUMNS(destlng double)") 
 sql("INSERT OVERWRITE TABLE table_with_partition  partition (ds='1') SELECT 
key,value,'test',1.11 FROM testData")
 sql("select * from table_with_partition where ds='1' 
").collect().foreach(println)      
 
result : 
[1,1,null,null,1]
[2,2,null,null,1]
 
result we expect:
[1,1,test,1.11,1]
[2,2,test,1.11,1]



> [SPARK-SQL]when the partition schema does not match table schema(ADD COLUMN), 
> new column value is NULL
> ------------------------------------------------------------------------------------------------------
>
>                 Key: SPARK-6644
>                 URL: https://issues.apache.org/jira/browse/SPARK-6644
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 1.3.0
>            Reporter: dongxu
>
> In hive,the schema of partition may be difference from the table schema. For 
> example, we add new column. When we use spark-sql to query the data of 
> partition which schema is difference from the table schema.
> some problems is solved(https://github.com/apache/spark/pull/4289), 
> but if you add a new column,put new data into the old partition,new column 
> value is NULL
> [According to the following steps]:
> case class TestData(key: Int, value: String)
> val testData = TestHive.sparkContext.parallelize((1 to 10).map(i => 
> TestData(i, i.toString))).toDF()
>       testData.registerTempTable("testData")
>  sql("DROP TABLE IF EXISTS table_with_partition ")
>  sql(s"CREATE  TABLE  IF NOT EXISTS  table_with_partition(key int,value 
> string) PARTITIONED by (ds string) location '${tmpDir.toURI.toString}' ")
>  sql("INSERT OVERWRITE TABLE table_with_partition  partition (ds='1') SELECT 
> key,value FROM testData")
>     // add column to table
>  sql("ALTER TABLE table_with_partition ADD COLUMNS(key1 string)")
>  sql("ALTER TABLE table_with_partition ADD COLUMNS(destlng double)") 
>  sql("INSERT OVERWRITE TABLE table_with_partition  partition (ds='1') SELECT 
> key,value,'test',1.11 FROM testData")
>  sql("select * from table_with_partition where ds='1' 
> ").collect().foreach(println)    
>  
> result : 
> [1,1,null,null,1]
> [2,2,null,null,1]
>  
> result we expect:
> [1,1,test,1.11,1]
> [2,2,test,1.11,1]
> This bug will cause the wrong queyr number ,when we query like that: 
> select  count(1)  from  table_with_partition  where   key1  is not NULL



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to