[ https://issues.apache.org/jira/browse/SPARK-6644?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
dongxu updated SPARK-6644: -------------------------- Description: In hive,the schema of partition may be difference from the table schema. For example, we add new column. When we use spark-sql to query the data of partition which schema is difference from the table schema. some problems is solved(https://github.com/apache/spark/pull/4289), but if you add a new column,put new data into the old partition,new column value is NULL [According to the following steps]: case class TestData(key: Int, value: String) val testData = TestHive.sparkContext.parallelize((1 to 10).map(i => TestData(i, i.toString))).toDF() testData.registerTempTable("testData") sql("DROP TABLE IF EXISTS table_with_partition ") sql(s"CREATE TABLE IF NOT EXISTS table_with_partition(key int,value string) PARTITIONED by (ds string) location '${tmpDir.toURI.toString}' ") sql("INSERT OVERWRITE TABLE table_with_partition partition (ds='1') SELECT key,value FROM testData") // add column to table sql("ALTER TABLE table_with_partition ADD COLUMNS(key1 string)") sql("ALTER TABLE table_with_partition ADD COLUMNS(destlng double)") sql("INSERT OVERWRITE TABLE table_with_partition partition (ds='1') SELECT key,value,'test',1.11 FROM testData") sql("select * from table_with_partition where ds='1' ").collect().foreach(println) result : [1,1,null,null,1] [2,2,null,null,1] result we expect: [1,1,test,1.11,1] [2,2,test,1.11,1] This bug will cause the wrong queyr number ,when we query like that: select count(1) from table_with_partition where key1 is not NULL was: In hive,the schema of partition may be difference from the table schema. For example, we add new column. When we use spark-sql to query the data of partition which schema is difference from the table schema. some problems is solved(https://github.com/apache/spark/pull/4289), but if you add a new column,put new data into the old partition,new column value is NULL [According to the following steps]: case class TestData(key: Int, value: String) val testData = TestHive.sparkContext.parallelize( (1 to 10).map(i => TestData(i, i.toString))).toDF() testData.registerTempTable("testData") sql("DROP TABLE IF EXISTS table_with_partition ") sql(s"CREATE TABLE IF NOT EXISTS table_with_partition(key int,value string) PARTITIONED by (ds string) location '${tmpDir.toURI.toString}' ") sql("INSERT OVERWRITE TABLE table_with_partition partition (ds='1') SELECT key,value FROM testData") // add column to table sql("ALTER TABLE table_with_partition ADD COLUMNS(key1 string)") sql("ALTER TABLE table_with_partition ADD COLUMNS(destlng double)") sql("INSERT OVERWRITE TABLE table_with_partition partition (ds='1') SELECT key,value,'test',1.11 FROM testData") sql("select * from table_with_partition where ds='1' ").collect().foreach(println) result : [1,1,null,null,1] [2,2,null,null,1] result we expect: [1,1,test,1.11,1] [2,2,test,1.11,1] > [SPARK-SQL]when the partition schema does not match table schema(ADD COLUMN), > new column value is NULL > ------------------------------------------------------------------------------------------------------ > > Key: SPARK-6644 > URL: https://issues.apache.org/jira/browse/SPARK-6644 > Project: Spark > Issue Type: Bug > Components: SQL > Affects Versions: 1.3.0 > Reporter: dongxu > > In hive,the schema of partition may be difference from the table schema. For > example, we add new column. When we use spark-sql to query the data of > partition which schema is difference from the table schema. > some problems is solved(https://github.com/apache/spark/pull/4289), > but if you add a new column,put new data into the old partition,new column > value is NULL > [According to the following steps]: > case class TestData(key: Int, value: String) > val testData = TestHive.sparkContext.parallelize((1 to 10).map(i => > TestData(i, i.toString))).toDF() > testData.registerTempTable("testData") > sql("DROP TABLE IF EXISTS table_with_partition ") > sql(s"CREATE TABLE IF NOT EXISTS table_with_partition(key int,value > string) PARTITIONED by (ds string) location '${tmpDir.toURI.toString}' ") > sql("INSERT OVERWRITE TABLE table_with_partition partition (ds='1') SELECT > key,value FROM testData") > // add column to table > sql("ALTER TABLE table_with_partition ADD COLUMNS(key1 string)") > sql("ALTER TABLE table_with_partition ADD COLUMNS(destlng double)") > sql("INSERT OVERWRITE TABLE table_with_partition partition (ds='1') SELECT > key,value,'test',1.11 FROM testData") > sql("select * from table_with_partition where ds='1' > ").collect().foreach(println) > > result : > [1,1,null,null,1] > [2,2,null,null,1] > > result we expect: > [1,1,test,1.11,1] > [2,2,test,1.11,1] > This bug will cause the wrong queyr number ,when we query like that: > select count(1) from table_with_partition where key1 is not NULL -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org