[ https://issues.apache.org/jira/browse/SPARK-6644?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Cheng Lian updated SPARK-6644: ------------------------------ Description: In Hive, the schema of a partition may differ from the table schema. For example, we may add new columns to the table after importing existing partitions. When using {{spark-sql}} to query the data in a partition whose schema is different from the table schema, problems may arise. Part of them have been solved in [PR #4289|https://github.com/apache/spark/pull/4289]. However, after adding new column(s) to the table, when inserting data into old partitions, values of newly added columns are all {{NULL}}. The following snippet can be used to reproduce this issue: {code} case class TestData(key: Int, value: String) val testData = TestHive.sparkContext.parallelize((1 to 2).map(i => TestData(i, i.toString))).toDF() testData.registerTempTable("testData") sql("DROP TABLE IF EXISTS table_with_partition ") sql(s"CREATE TABLE IF NOT EXISTS table_with_partition (key INT, value STRING) PARTITIONED BY (ds STRING) LOCATION '${tmpDir.toURI.toString}'") sql("INSERT OVERWRITE TABLE table_with_partition PARTITION (ds = '1') SELECT key, value FROM testData") // Add new columns to the table sql("ALTER TABLE table_with_partition ADD COLUMNS (key1 STRING)") sql("ALTER TABLE table_with_partition ADD COLUMNS (destlng DOUBLE)") sql("INSERT OVERWRITE TABLE table_with_partition PARTITION (ds = '1') SELECT key, value, 'test', 1.11 FROM testData") sql("SELECT * FROM table_with_partition WHERE ds = '1'").collect().foreach(println) {code} Actual result: {noformat} [1,1,null,null,1] [2,2,null,null,1] {noformat} Expected result: {noformat} [1,1,test,1.11,1] [2,2,test,1.11,1] {noformat} was: In Hive, the schema of a partition may differ from the table schema. For example, we may add new columns to the table after importing existing partitions. When using {{spark-sql}} to query the data in a partition whose schema is different from the table schema, problems may arise. Part of them have been solved in [PR #4289|https://github.com/apache/spark/pull/4289]. However, after adding new column(s) to the table, when inserting data into old partitions, values of newly added columns are all {{NULL}}. The following snippet can be used to reproduce this issue: {code} case class TestData(key: Int, value: String) val testData = TestHive.sparkContext.parallelize((1 to 2).map(i => TestData(i, i.toString))).toDF() testData.registerTempTable("testData") sql("DROP TABLE IF EXISTS table_with_partition ") sql(s"CREATE TABLE IF NOT EXISTS table_with_partition(key int, value string) PARTITIONED by (ds string) location '${tmpDir.toURI.toString}'") sql("INSERT OVERWRITE TABLE table_with_partition PARTITION (ds = '1') SELECT key, value FROM testData") // Add new columns to the table sql("ALTER TABLE table_with_partition ADD COLUMNS(key1 string)") sql("ALTER TABLE table_with_partition ADD COLUMNS(destlng double)") sql("INSERT OVERWRITE TABLE table_with_partition PARTITION (ds = '1') SELECT key, value, 'test', 1.11 FROM testData") sql("SELECT * FROM table_with_partition WHERE ds = '1'").collect().foreach(println) {code} Actual result: {noformat} [1,1,null,null,1] [2,2,null,null,1] {noformat} Expected result: {noformat} [1,1,test,1.11,1] [2,2,test,1.11,1] {noformat} > After adding new columns to a partitioned table and inserting data to an old > partition, data of newly added columns are all NULL > -------------------------------------------------------------------------------------------------------------------------------- > > Key: SPARK-6644 > URL: https://issues.apache.org/jira/browse/SPARK-6644 > Project: Spark > Issue Type: Bug > Components: SQL > Affects Versions: 1.3.0 > Reporter: dongxu > > In Hive, the schema of a partition may differ from the table schema. For > example, we may add new columns to the table after importing existing > partitions. When using {{spark-sql}} to query the data in a partition whose > schema is different from the table schema, problems may arise. Part of them > have been solved in [PR #4289|https://github.com/apache/spark/pull/4289]. > However, after adding new column(s) to the table, when inserting data into > old partitions, values of newly added columns are all {{NULL}}. > The following snippet can be used to reproduce this issue: > {code} > case class TestData(key: Int, value: String) > val testData = TestHive.sparkContext.parallelize((1 to 2).map(i => > TestData(i, i.toString))).toDF() > testData.registerTempTable("testData") > sql("DROP TABLE IF EXISTS table_with_partition ") > sql(s"CREATE TABLE IF NOT EXISTS table_with_partition (key INT, value STRING) > PARTITIONED BY (ds STRING) LOCATION '${tmpDir.toURI.toString}'") > sql("INSERT OVERWRITE TABLE table_with_partition PARTITION (ds = '1') SELECT > key, value FROM testData") > // Add new columns to the table > sql("ALTER TABLE table_with_partition ADD COLUMNS (key1 STRING)") > sql("ALTER TABLE table_with_partition ADD COLUMNS (destlng DOUBLE)") > sql("INSERT OVERWRITE TABLE table_with_partition PARTITION (ds = '1') SELECT > key, value, 'test', 1.11 FROM testData") > sql("SELECT * FROM table_with_partition WHERE ds = > '1'").collect().foreach(println) > {code} > Actual result: > {noformat} > [1,1,null,null,1] > [2,2,null,null,1] > {noformat} > Expected result: > {noformat} > [1,1,test,1.11,1] > [2,2,test,1.11,1] > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org