[jira] [Updated] (SPARK-6644) [SPARK-SQL]when the partition schema does not match table schema(ADD COLUMN), new column value is NULL
[ https://issues.apache.org/jira/browse/SPARK-6644?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian updated SPARK-6644: -- Description: In Hive, the schema of a partition may differ from the table schema. For example, we may add new columns to the table after importing existing partitions. When using {{spark-sql}} to query the data in a partition whose schema is different from the table schema, problems may arise. Part of them have been solved in [PR #4289|https://github.com/apache/spark/pull/4289]. However, after adding new column(s) to the table, when inserting data into old partitions, values of newly added columns are all {{NULL}}. The following snippet can be used to reproduce this issue: {code} case class TestData(key: Int, value: String) val testData = TestHive.sparkContext.parallelize((1 to 2).map(i = TestData(i, i.toString))).toDF() testData.registerTempTable(testData) sql(DROP TABLE IF EXISTS table_with_partition ) sql(sCREATE TABLE IF NOT EXISTS table_with_partition(key int, value string) PARTITIONED by (ds string) location '${tmpDir.toURI.toString}') sql(INSERT OVERWRITE TABLE table_with_partition PARTITION (ds = '1') SELECT key, value FROM testData) // Add new columns to the table sql(ALTER TABLE table_with_partition ADD COLUMNS(key1 string)) sql(ALTER TABLE table_with_partition ADD COLUMNS(destlng double)) sql(INSERT OVERWRITE TABLE table_with_partition PARTITION (ds = '1') SELECT key, value, 'test', 1.11 FROM testData) sql(SELECT * FROM table_with_partition WHERE ds = '1').collect().foreach(println) {code} Actual result: {noformat} [1,1,null,null,1] [2,2,null,null,1] {noformat} Expected result: {noformat} [1,1,test,1.11,1] [2,2,test,1.11,1] {noformat} was: In hive,the schema of partition may be difference from the table schema. For example, we add new column. When we use spark-sql to query the data of partition which schema is difference from the table schema. Some problems have been solved at PR4289 (https://github.com/apache/spark/pull/4289), but if we add new column, and put new data into the old partition schema,new column value is NULL [According to the following steps]: -- case class TestData(key: Int, value: String) val testData = TestHive.sparkContext.parallelize((1 to 2).map(i = TestData(i, i.toString))).toDF() testData.registerTempTable(testData) sql(DROP TABLE IF EXISTS table_with_partition ) sql(sCREATE TABLE IF NOT EXISTS table_with_partition(key int,value string) PARTITIONED by (ds string) location '${tmpDir.toURI.toString}' ) sql(INSERT OVERWRITE TABLE table_with_partition partition (ds='1') SELECT key,value FROM testData) // add column to table sql(ALTER TABLE table_with_partition ADD COLUMNS(key1 string)) sql(ALTER TABLE table_with_partition ADD COLUMNS(destlng double)) sql(INSERT OVERWRITE TABLE table_with_partition partition (ds='1') SELECT key,value,'test',1.11 FROM testData) sql(select * from table_with_partition where ds='1' ).collect().foreach(println) - result: [1,1,null,null,1] [2,2,null,null,1] result we expect: [1,1,test,1.11,1] [2,2,test,1.11,1] This bug will cause the wrong query number ,when we query : select count(1) from table_with_partition where key1 is not NULL [SPARK-SQL]when the partition schema does not match table schema(ADD COLUMN), new column value is NULL -- Key: SPARK-6644 URL: https://issues.apache.org/jira/browse/SPARK-6644 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.3.0 Reporter: dongxu In Hive, the schema of a partition may differ from the table schema. For example, we may add new columns to the table after importing existing partitions. When using {{spark-sql}} to query the data in a partition whose schema is different from the table schema, problems may arise. Part of them have been solved in [PR #4289|https://github.com/apache/spark/pull/4289]. However, after adding new column(s) to the table, when inserting data into old partitions, values of newly added columns are all {{NULL}}. The following snippet can be used to reproduce this issue: {code} case class TestData(key: Int, value: String) val testData = TestHive.sparkContext.parallelize((1 to 2).map(i = TestData(i, i.toString))).toDF() testData.registerTempTable(testData) sql(DROP TABLE IF EXISTS table_with_partition ) sql(sCREATE TABLE IF NOT EXISTS table_with_partition(key int, value string) PARTITIONED by (ds string) location '${tmpDir.toURI.toString}') sql(INSERT OVERWRITE TABLE table_with_partition
[jira] [Updated] (SPARK-6644) [SPARK-SQL]when the partition schema does not match table schema(ADD COLUMN), new column value is NULL
[ https://issues.apache.org/jira/browse/SPARK-6644?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] dongxu updated SPARK-6644: -- Description: In hive,the schema of partition may be difference from the table schema. For example, we add new column. When we use spark-sql to query the data of partition which schema is difference from the table schema. Some problems have been solved at PR4289 (https://github.com/apache/spark/pull/4289), but if you add a new column,put new data into the old partition,new column value is NULL [According to the following steps]: case class TestData(key: Int, value: String) val testData = TestHive.sparkContext.parallelize((1 to 10).map(i = TestData(i, i.toString))).toDF() testData.registerTempTable(testData) //inititi sql(DROP TABLE IF EXISTS table_with_partition ) sql(sCREATE TABLE IF NOT EXISTS table_with_partition(key int,value string) PARTITIONED by (ds string) location '${tmpDir.toURI.toString}' ) sql(INSERT OVERWRITE TABLE table_with_partition partition (ds='1') SELECT key,value FROM testData) // add column to table sql(ALTER TABLE table_with_partition ADD COLUMNS(key1 string)) sql(ALTER TABLE table_with_partition ADD COLUMNS(destlng double)) sql(INSERT OVERWRITE TABLE table_with_partition partition (ds='1') SELECT key,value,'test',1.11 FROM testData) sql(select * from table_with_partition where ds='1' ).collect().foreach(println) result : [1,1,null,null,1] [2,2,null,null,1] result we expect: [1,1,test,1.11,1] [2,2,test,1.11,1] This bug will cause the wrong queyr number ,when we query like that: select count(1) from table_with_partition where key1 is not NULL was: In hive,the schema of partition may be difference from the table schema. For example, we add new column. When we use spark-sql to query the data of partition which schema is difference from the table schema. some problems is solved(https://github.com/apache/spark/pull/4289), but if you add a new column,put new data into the old partition,new column value is NULL [According to the following steps]: case class TestData(key: Int, value: String) val testData = TestHive.sparkContext.parallelize((1 to 10).map(i = TestData(i, i.toString))).toDF() testData.registerTempTable(testData) sql(DROP TABLE IF EXISTS table_with_partition ) sql(sCREATE TABLE IF NOT EXISTS table_with_partition(key int,value string) PARTITIONED by (ds string) location '${tmpDir.toURI.toString}' ) sql(INSERT OVERWRITE TABLE table_with_partition partition (ds='1') SELECT key,value FROM testData) // add column to table sql(ALTER TABLE table_with_partition ADD COLUMNS(key1 string)) sql(ALTER TABLE table_with_partition ADD COLUMNS(destlng double)) sql(INSERT OVERWRITE TABLE table_with_partition partition (ds='1') SELECT key,value,'test',1.11 FROM testData) sql(select * from table_with_partition where ds='1' ).collect().foreach(println) result : [1,1,null,null,1] [2,2,null,null,1] result we expect: [1,1,test,1.11,1] [2,2,test,1.11,1] This bug will cause the wrong queyr number ,when we query like that: select count(1) from table_with_partition where key1 is not NULL [SPARK-SQL]when the partition schema does not match table schema(ADD COLUMN), new column value is NULL -- Key: SPARK-6644 URL: https://issues.apache.org/jira/browse/SPARK-6644 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.3.0 Reporter: dongxu In hive,the schema of partition may be difference from the table schema. For example, we add new column. When we use spark-sql to query the data of partition which schema is difference from the table schema. Some problems have been solved at PR4289 (https://github.com/apache/spark/pull/4289), but if you add a new column,put new data into the old partition,new column value is NULL [According to the following steps]: case class TestData(key: Int, value: String) val testData = TestHive.sparkContext.parallelize((1 to 10).map(i = TestData(i, i.toString))).toDF() testData.registerTempTable(testData) //inititi sql(DROP TABLE IF EXISTS table_with_partition ) sql(sCREATE TABLE IF NOT EXISTS table_with_partition(key int,value string) PARTITIONED by (ds string) location '${tmpDir.toURI.toString}' ) sql(INSERT OVERWRITE TABLE table_with_partition partition (ds='1') SELECT key,value FROM testData) // add column to table sql(ALTER TABLE table_with_partition ADD COLUMNS(key1 string)) sql(ALTER TABLE table_with_partition ADD COLUMNS(destlng double)) sql(INSERT OVERWRITE TABLE table_with_partition partition (ds='1') SELECT key,value,'test',1.11 FROM testData) sql(select * from table_with_partition where ds='1' ).collect().foreach(println)
[jira] [Updated] (SPARK-6644) [SPARK-SQL]when the partition schema does not match table schema(ADD COLUMN), new column value is NULL
[ https://issues.apache.org/jira/browse/SPARK-6644?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] dongxu updated SPARK-6644: -- Description: In hive,the schema of partition may be difference from the table schema. For example, we add new column. When we use spark-sql to query the data of partition which schema is difference from the table schema. Some problems have been solved at PR4289 (https://github.com/apache/spark/pull/4289), but if you add a new column,put new data into the old partition,new column value is NULL [According to the following steps]: case class TestData(key: Int, value: String) val testData = TestHive.sparkContext.parallelize((1 to 2).map(i = TestData(i, i.toString))).toDF() testData.registerTempTable(testData) sql(DROP TABLE IF EXISTS table_with_partition ) sql(sCREATE TABLE IF NOT EXISTS table_with_partition(key int,value string) PARTITIONED by (ds string) location '${tmpDir.toURI.toString}' ) sql(INSERT OVERWRITE TABLE table_with_partition partition (ds='1') SELECT key,value FROM testData) // add column to table sql(ALTER TABLE table_with_partition ADD COLUMNS(key1 string)) sql(ALTER TABLE table_with_partition ADD COLUMNS(destlng double)) sql(INSERT OVERWRITE TABLE table_with_partition partition (ds='1') SELECT key,value,'test',1.11 FROM testData) sql(select * from table_with_partition where ds='1' ).collect().foreach(println) result : [1,1,null,null,1] [2,2,null,null,1] result we expect: [1,1,test,1.11,1] [2,2,test,1.11,1] This bug will cause the wrong queyr number ,when we query like that: select count(1) from table_with_partition where key1 is not NULL was: In hive,the schema of partition may be difference from the table schema. For example, we add new column. When we use spark-sql to query the data of partition which schema is difference from the table schema. Some problems have been solved at PR4289 (https://github.com/apache/spark/pull/4289), but if you add a new column,put new data into the old partition,new column value is NULL [According to the following steps]: case class TestData(key: Int, value: String) val testData = TestHive.sparkContext.parallelize((1 to 10).map(i = TestData(i, i.toString))).toDF() testData.registerTempTable(testData) //inititi sql(DROP TABLE IF EXISTS table_with_partition ) sql(sCREATE TABLE IF NOT EXISTS table_with_partition(key int,value string) PARTITIONED by (ds string) location '${tmpDir.toURI.toString}' ) sql(INSERT OVERWRITE TABLE table_with_partition partition (ds='1') SELECT key,value FROM testData) // add column to table sql(ALTER TABLE table_with_partition ADD COLUMNS(key1 string)) sql(ALTER TABLE table_with_partition ADD COLUMNS(destlng double)) sql(INSERT OVERWRITE TABLE table_with_partition partition (ds='1') SELECT key,value,'test',1.11 FROM testData) sql(select * from table_with_partition where ds='1' ).collect().foreach(println) result : [1,1,null,null,1] [2,2,null,null,1] result we expect: [1,1,test,1.11,1] [2,2,test,1.11,1] This bug will cause the wrong queyr number ,when we query like that: select count(1) from table_with_partition where key1 is not NULL [SPARK-SQL]when the partition schema does not match table schema(ADD COLUMN), new column value is NULL -- Key: SPARK-6644 URL: https://issues.apache.org/jira/browse/SPARK-6644 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.3.0 Reporter: dongxu In hive,the schema of partition may be difference from the table schema. For example, we add new column. When we use spark-sql to query the data of partition which schema is difference from the table schema. Some problems have been solved at PR4289 (https://github.com/apache/spark/pull/4289), but if you add a new column,put new data into the old partition,new column value is NULL [According to the following steps]: case class TestData(key: Int, value: String) val testData = TestHive.sparkContext.parallelize((1 to 2).map(i = TestData(i, i.toString))).toDF() testData.registerTempTable(testData) sql(DROP TABLE IF EXISTS table_with_partition ) sql(sCREATE TABLE IF NOT EXISTS table_with_partition(key int,value string) PARTITIONED by (ds string) location '${tmpDir.toURI.toString}' ) sql(INSERT OVERWRITE TABLE table_with_partition partition (ds='1') SELECT key,value FROM testData) // add column to table sql(ALTER TABLE table_with_partition ADD COLUMNS(key1 string)) sql(ALTER TABLE table_with_partition ADD COLUMNS(destlng double)) sql(INSERT OVERWRITE TABLE table_with_partition partition (ds='1') SELECT key,value,'test',1.11 FROM testData) sql(select * from table_with_partition where ds='1'
[jira] [Updated] (SPARK-6644) [SPARK-SQL]when the partition schema does not match table schema(ADD COLUMN), new column value is NULL
[ https://issues.apache.org/jira/browse/SPARK-6644?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] dongxu updated SPARK-6644: -- Description: In hive,the schema of partition may be difference from the table schema. For example, we add new column. When we use spark-sql to query the data of partition which schema is difference from the table schema. Some problems have been solved at PR4289 (https://github.com/apache/spark/pull/4289), but if you add a new column,put new data into the old partition,new column value is NULL [According to the following steps]: -- case class TestData(key: Int, value: String) val testData = TestHive.sparkContext.parallelize((1 to 2).map(i = TestData(i, i.toString))).toDF() testData.registerTempTable(testData) sql(DROP TABLE IF EXISTS table_with_partition ) sql(sCREATE TABLE IF NOT EXISTS table_with_partition(key int,value string) PARTITIONED by (ds string) location '${tmpDir.toURI.toString}' ) sql(INSERT OVERWRITE TABLE table_with_partition partition (ds='1') SELECT key,value FROM testData) // add column to table sql(ALTER TABLE table_with_partition ADD COLUMNS(key1 string)) sql(ALTER TABLE table_with_partition ADD COLUMNS(destlng double)) sql(INSERT OVERWRITE TABLE table_with_partition partition (ds='1') SELECT key,value,'test',1.11 FROM testData) sql(select * from table_with_partition where ds='1' ).collect().foreach(println) - result: [1,1,null,null,1] [2,2,null,null,1] result we expect: [1,1,test,1.11,1] [2,2,test,1.11,1] This bug will cause the wrong query number ,when we query : select count(1) from table_with_partition where key1 is not NULL was: In hive,the schema of partition may be difference from the table schema. For example, we add new column. When we use spark-sql to query the data of partition which schema is difference from the table schema. Some problems have been solved at PR4289 (https://github.com/apache/spark/pull/4289), but if you add a new column,put new data into the old partition,new column value is NULL [According to the following steps]: case class TestData(key: Int, value: String) val testData = TestHive.sparkContext.parallelize((1 to 2).map(i = TestData(i, i.toString))).toDF() testData.registerTempTable(testData) sql(DROP TABLE IF EXISTS table_with_partition ) sql(sCREATE TABLE IF NOT EXISTS table_with_partition(key int,value string) PARTITIONED by (ds string) location '${tmpDir.toURI.toString}' ) sql(INSERT OVERWRITE TABLE table_with_partition partition (ds='1') SELECT key,value FROM testData) // add column to table sql(ALTER TABLE table_with_partition ADD COLUMNS(key1 string)) sql(ALTER TABLE table_with_partition ADD COLUMNS(destlng double)) sql(INSERT OVERWRITE TABLE table_with_partition partition (ds='1') SELECT key,value,'test',1.11 FROM testData) sql(select * from table_with_partition where ds='1' ).collect().foreach(println) result : [1,1,null,null,1] [2,2,null,null,1] result we expect: [1,1,test,1.11,1] [2,2,test,1.11,1] This bug will cause the wrong queyr number ,when we query like that: select count(1) from table_with_partition where key1 is not NULL [SPARK-SQL]when the partition schema does not match table schema(ADD COLUMN), new column value is NULL -- Key: SPARK-6644 URL: https://issues.apache.org/jira/browse/SPARK-6644 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.3.0 Reporter: dongxu In hive,the schema of partition may be difference from the table schema. For example, we add new column. When we use spark-sql to query the data of partition which schema is difference from the table schema. Some problems have been solved at PR4289 (https://github.com/apache/spark/pull/4289), but if you add a new column,put new data into the old partition,new column value is NULL [According to the following steps]: -- case class TestData(key: Int, value: String) val testData = TestHive.sparkContext.parallelize((1 to 2).map(i = TestData(i, i.toString))).toDF() testData.registerTempTable(testData) sql(DROP TABLE IF EXISTS table_with_partition ) sql(sCREATE TABLE IF NOT EXISTS table_with_partition(key int,value string) PARTITIONED by (ds string) location '${tmpDir.toURI.toString}' ) sql(INSERT OVERWRITE TABLE table_with_partition partition (ds='1') SELECT key,value FROM testData) // add column to table sql(ALTER TABLE
[jira] [Updated] (SPARK-6644) [SPARK-SQL]when the partition schema does not match table schema(ADD COLUMN), new column value is NULL
[ https://issues.apache.org/jira/browse/SPARK-6644?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] dongxu updated SPARK-6644: -- Description: In hive,the schema of partition may be difference from the table schema. For example, we add new column. When we use spark-sql to query the data of partition which schema is difference from the table schema. some problems is solved(https://github.com/apache/spark/pull/4289), but if you add a new column,put new data into the old partition,new column value is NULL [According to the following steps]: case class TestData(key: Int, value: String) val testData = TestHive.sparkContext.parallelize((1 to 10).map(i = TestData(i, i.toString))).toDF() testData.registerTempTable(testData) sql(DROP TABLE IF EXISTS table_with_partition ) sql(sCREATE TABLE IF NOT EXISTS table_with_partition(key int,value string) PARTITIONED by (ds string) location '${tmpDir.toURI.toString}' ) sql(INSERT OVERWRITE TABLE table_with_partition partition (ds='1') SELECT key,value FROM testData) // add column to table sql(ALTER TABLE table_with_partition ADD COLUMNS(key1 string)) sql(ALTER TABLE table_with_partition ADD COLUMNS(destlng double)) sql(INSERT OVERWRITE TABLE table_with_partition partition (ds='1') SELECT key,value,'test',1.11 FROM testData) sql(select * from table_with_partition where ds='1' ).collect().foreach(println) result : [1,1,null,null,1] [2,2,null,null,1] result we expect: [1,1,test,1.11,1] [2,2,test,1.11,1] This bug will cause the wrong queyr number ,when we query like that: select count(1) from table_with_partition where key1 is not NULL was: In hive,the schema of partition may be difference from the table schema. For example, we add new column. When we use spark-sql to query the data of partition which schema is difference from the table schema. some problems is solved(https://github.com/apache/spark/pull/4289), but if you add a new column,put new data into the old partition,new column value is NULL [According to the following steps]: case class TestData(key: Int, value: String) val testData = TestHive.sparkContext.parallelize( (1 to 10).map(i = TestData(i, i.toString))).toDF() testData.registerTempTable(testData) sql(DROP TABLE IF EXISTS table_with_partition ) sql(sCREATE TABLE IF NOT EXISTS table_with_partition(key int,value string) PARTITIONED by (ds string) location '${tmpDir.toURI.toString}' ) sql(INSERT OVERWRITE TABLE table_with_partition partition (ds='1') SELECT key,value FROM testData) // add column to table sql(ALTER TABLE table_with_partition ADD COLUMNS(key1 string)) sql(ALTER TABLE table_with_partition ADD COLUMNS(destlng double)) sql(INSERT OVERWRITE TABLE table_with_partition partition (ds='1') SELECT key,value,'test',1.11 FROM testData) sql(select * from table_with_partition where ds='1' ).collect().foreach(println) result : [1,1,null,null,1] [2,2,null,null,1] result we expect: [1,1,test,1.11,1] [2,2,test,1.11,1] [SPARK-SQL]when the partition schema does not match table schema(ADD COLUMN), new column value is NULL -- Key: SPARK-6644 URL: https://issues.apache.org/jira/browse/SPARK-6644 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.3.0 Reporter: dongxu In hive,the schema of partition may be difference from the table schema. For example, we add new column. When we use spark-sql to query the data of partition which schema is difference from the table schema. some problems is solved(https://github.com/apache/spark/pull/4289), but if you add a new column,put new data into the old partition,new column value is NULL [According to the following steps]: case class TestData(key: Int, value: String) val testData = TestHive.sparkContext.parallelize((1 to 10).map(i = TestData(i, i.toString))).toDF() testData.registerTempTable(testData) sql(DROP TABLE IF EXISTS table_with_partition ) sql(sCREATE TABLE IF NOT EXISTS table_with_partition(key int,value string) PARTITIONED by (ds string) location '${tmpDir.toURI.toString}' ) sql(INSERT OVERWRITE TABLE table_with_partition partition (ds='1') SELECT key,value FROM testData) // add column to table sql(ALTER TABLE table_with_partition ADD COLUMNS(key1 string)) sql(ALTER TABLE table_with_partition ADD COLUMNS(destlng double)) sql(INSERT OVERWRITE TABLE table_with_partition partition (ds='1') SELECT key,value,'test',1.11 FROM testData) sql(select * from table_with_partition where ds='1' ).collect().foreach(println) result : [1,1,null,null,1] [2,2,null,null,1] result we expect: [1,1,test,1.11,1] [2,2,test,1.11,1] This bug will cause the wrong queyr number ,when we query like that: select
[jira] [Updated] (SPARK-6644) [SPARK-SQL]when the partition schema does not match table schema(ADD COLUMN), new column value is NULL
[ https://issues.apache.org/jira/browse/SPARK-6644?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] dongxu updated SPARK-6644: -- Description: In hive,the schema of partition may be difference from the table schema. For example, we add new column. When we use spark-sql to query the data of partition which schema is difference from the table schema. Some problems have been solved at PR4289 (https://github.com/apache/spark/pull/4289), but if we add new column, and put new data into the old partition schema,new column value is NULL [According to the following steps]: -- case class TestData(key: Int, value: String) val testData = TestHive.sparkContext.parallelize((1 to 2).map(i = TestData(i, i.toString))).toDF() testData.registerTempTable(testData) sql(DROP TABLE IF EXISTS table_with_partition ) sql(sCREATE TABLE IF NOT EXISTS table_with_partition(key int,value string) PARTITIONED by (ds string) location '${tmpDir.toURI.toString}' ) sql(INSERT OVERWRITE TABLE table_with_partition partition (ds='1') SELECT key,value FROM testData) // add column to table sql(ALTER TABLE table_with_partition ADD COLUMNS(key1 string)) sql(ALTER TABLE table_with_partition ADD COLUMNS(destlng double)) sql(INSERT OVERWRITE TABLE table_with_partition partition (ds='1') SELECT key,value,'test',1.11 FROM testData) sql(select * from table_with_partition where ds='1' ).collect().foreach(println) - result: [1,1,null,null,1] [2,2,null,null,1] result we expect: [1,1,test,1.11,1] [2,2,test,1.11,1] This bug will cause the wrong query number ,when we query : select count(1) from table_with_partition where key1 is not NULL was: In hive,the schema of partition may be difference from the table schema. For example, we add new column. When we use spark-sql to query the data of partition which schema is difference from the table schema. Some problems have been solved at PR4289 (https://github.com/apache/spark/pull/4289), but if you add a new column,put new data into the old partition schema,new column value is NULL [According to the following steps]: -- case class TestData(key: Int, value: String) val testData = TestHive.sparkContext.parallelize((1 to 2).map(i = TestData(i, i.toString))).toDF() testData.registerTempTable(testData) sql(DROP TABLE IF EXISTS table_with_partition ) sql(sCREATE TABLE IF NOT EXISTS table_with_partition(key int,value string) PARTITIONED by (ds string) location '${tmpDir.toURI.toString}' ) sql(INSERT OVERWRITE TABLE table_with_partition partition (ds='1') SELECT key,value FROM testData) // add column to table sql(ALTER TABLE table_with_partition ADD COLUMNS(key1 string)) sql(ALTER TABLE table_with_partition ADD COLUMNS(destlng double)) sql(INSERT OVERWRITE TABLE table_with_partition partition (ds='1') SELECT key,value,'test',1.11 FROM testData) sql(select * from table_with_partition where ds='1' ).collect().foreach(println) - result: [1,1,null,null,1] [2,2,null,null,1] result we expect: [1,1,test,1.11,1] [2,2,test,1.11,1] This bug will cause the wrong query number ,when we query : select count(1) from table_with_partition where key1 is not NULL [SPARK-SQL]when the partition schema does not match table schema(ADD COLUMN), new column value is NULL -- Key: SPARK-6644 URL: https://issues.apache.org/jira/browse/SPARK-6644 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.3.0 Reporter: dongxu In hive,the schema of partition may be difference from the table schema. For example, we add new column. When we use spark-sql to query the data of partition which schema is difference from the table schema. Some problems have been solved at PR4289 (https://github.com/apache/spark/pull/4289), but if we add new column, and put new data into the old partition schema,new column value is NULL [According to the following steps]: -- case class TestData(key: Int, value: String) val testData = TestHive.sparkContext.parallelize((1 to 2).map(i = TestData(i, i.toString))).toDF() testData.registerTempTable(testData) sql(DROP TABLE IF EXISTS table_with_partition ) sql(sCREATE TABLE IF NOT EXISTS table_with_partition(key int,value string) PARTITIONED by
[jira] [Updated] (SPARK-6644) [SPARK-SQL]when the partition schema does not match table schema(ADD COLUMN), new column value is NULL
[ https://issues.apache.org/jira/browse/SPARK-6644?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] dongxu updated SPARK-6644: -- Summary: [SPARK-SQL]when the partition schema does not match table schema(ADD COLUMN), new column value is NULL (was: [SPARK-SQL]when the partition schema does not match table schema(ADD COLUMN), new column is NULL) [SPARK-SQL]when the partition schema does not match table schema(ADD COLUMN), new column value is NULL -- Key: SPARK-6644 URL: https://issues.apache.org/jira/browse/SPARK-6644 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.3.0 Reporter: dongxu In hive,the schema of partition may be difference from the table schema. For example, we add new column. When we use spark-sql to query the data of partition which schema is difference from the table schema. some problems is solved(https://github.com/apache/spark/pull/4289), but if you add a new column,put new data into the old partition,new column value is NULL [According to the following steps]: case class TestData(key: Int, value: String) val testData = TestHive.sparkContext.parallelize( (1 to 10).map(i = TestData(i, i.toString))).toDF() testData.registerTempTable(testData) sql(DROP TABLE IF EXISTS table_with_partition ) sql(sCREATE TABLE IF NOT EXISTS table_with_partition(key int,value string) PARTITIONED by (ds string) location '${tmpDir.toURI.toString}' ) sql(INSERT OVERWRITE TABLE table_with_partition partition (ds='1') SELECT key,value FROM testData) // add column to table sql(ALTER TABLE table_with_partition ADD COLUMNS(key1 string)) sql(ALTER TABLE table_with_partition ADD COLUMNS(destlng double)) sql(INSERT OVERWRITE TABLE table_with_partition partition (ds='1') SELECT key,value,'test',1.11 FROM testData) sql(select * from table_with_partition where ds='1' ).collect().foreach(println) result : [1,1,null,null,1] [2,2,null,null,1] result we expect: [1,1,test,1.11,1] [2,2,test,1.11,1] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6644) [SPARK-SQL]when the partition schema does not match table schema(ADD COLUMN), new column value is NULL
[ https://issues.apache.org/jira/browse/SPARK-6644?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] dongxu updated SPARK-6644: -- Description: In hive,the schema of partition may be difference from the table schema. For example, we add new column. When we use spark-sql to query the data of partition which schema is difference from the table schema. Some problems have been solved at PR4289 (https://github.com/apache/spark/pull/4289), but if you add a new column,put new data into the old partition schema,new column value is NULL [According to the following steps]: -- case class TestData(key: Int, value: String) val testData = TestHive.sparkContext.parallelize((1 to 2).map(i = TestData(i, i.toString))).toDF() testData.registerTempTable(testData) sql(DROP TABLE IF EXISTS table_with_partition ) sql(sCREATE TABLE IF NOT EXISTS table_with_partition(key int,value string) PARTITIONED by (ds string) location '${tmpDir.toURI.toString}' ) sql(INSERT OVERWRITE TABLE table_with_partition partition (ds='1') SELECT key,value FROM testData) // add column to table sql(ALTER TABLE table_with_partition ADD COLUMNS(key1 string)) sql(ALTER TABLE table_with_partition ADD COLUMNS(destlng double)) sql(INSERT OVERWRITE TABLE table_with_partition partition (ds='1') SELECT key,value,'test',1.11 FROM testData) sql(select * from table_with_partition where ds='1' ).collect().foreach(println) - result: [1,1,null,null,1] [2,2,null,null,1] result we expect: [1,1,test,1.11,1] [2,2,test,1.11,1] This bug will cause the wrong query number ,when we query : select count(1) from table_with_partition where key1 is not NULL was: In hive,the schema of partition may be difference from the table schema. For example, we add new column. When we use spark-sql to query the data of partition which schema is difference from the table schema. Some problems have been solved at PR4289 (https://github.com/apache/spark/pull/4289), but if you add a new column,put new data into the old partition,new column value is NULL [According to the following steps]: -- case class TestData(key: Int, value: String) val testData = TestHive.sparkContext.parallelize((1 to 2).map(i = TestData(i, i.toString))).toDF() testData.registerTempTable(testData) sql(DROP TABLE IF EXISTS table_with_partition ) sql(sCREATE TABLE IF NOT EXISTS table_with_partition(key int,value string) PARTITIONED by (ds string) location '${tmpDir.toURI.toString}' ) sql(INSERT OVERWRITE TABLE table_with_partition partition (ds='1') SELECT key,value FROM testData) // add column to table sql(ALTER TABLE table_with_partition ADD COLUMNS(key1 string)) sql(ALTER TABLE table_with_partition ADD COLUMNS(destlng double)) sql(INSERT OVERWRITE TABLE table_with_partition partition (ds='1') SELECT key,value,'test',1.11 FROM testData) sql(select * from table_with_partition where ds='1' ).collect().foreach(println) - result: [1,1,null,null,1] [2,2,null,null,1] result we expect: [1,1,test,1.11,1] [2,2,test,1.11,1] This bug will cause the wrong query number ,when we query : select count(1) from table_with_partition where key1 is not NULL [SPARK-SQL]when the partition schema does not match table schema(ADD COLUMN), new column value is NULL -- Key: SPARK-6644 URL: https://issues.apache.org/jira/browse/SPARK-6644 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.3.0 Reporter: dongxu In hive,the schema of partition may be difference from the table schema. For example, we add new column. When we use spark-sql to query the data of partition which schema is difference from the table schema. Some problems have been solved at PR4289 (https://github.com/apache/spark/pull/4289), but if you add a new column,put new data into the old partition schema,new column value is NULL [According to the following steps]: -- case class TestData(key: Int, value: String) val testData = TestHive.sparkContext.parallelize((1 to 2).map(i = TestData(i, i.toString))).toDF() testData.registerTempTable(testData) sql(DROP TABLE IF EXISTS table_with_partition ) sql(sCREATE TABLE IF NOT EXISTS table_with_partition(key int,value string) PARTITIONED by (ds