[
https://issues.apache.org/jira/browse/SPARK-6644?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Cheng Lian updated SPARK-6644:
--
Description:
In Hive, the schema of a partition may differ from the table schema. For
example, we may add new columns to the table after importing existing
partitions. When using {{spark-sql}} to query the data in a partition whose
schema is different from the table schema, problems may arise. Part of them
have been solved in [PR #4289|https://github.com/apache/spark/pull/4289].
However, after adding new column(s) to the table, when inserting data into old
partitions, values of newly added columns are all {{NULL}}.
The following snippet can be used to reproduce this issue:
{code}
case class TestData(key: Int, value: String)
val testData = TestHive.sparkContext.parallelize((1 to 2).map(i = TestData(i,
i.toString))).toDF()
testData.registerTempTable(testData)
sql(DROP TABLE IF EXISTS table_with_partition )
sql(sCREATE TABLE IF NOT EXISTS table_with_partition (key INT, value STRING)
PARTITIONED BY (ds STRING) LOCATION '${tmpDir.toURI.toString}')
sql(INSERT OVERWRITE TABLE table_with_partition PARTITION (ds = '1') SELECT
key, value FROM testData)
// Add new columns to the table
sql(ALTER TABLE table_with_partition ADD COLUMNS (key1 STRING))
sql(ALTER TABLE table_with_partition ADD COLUMNS (destlng DOUBLE))
sql(INSERT OVERWRITE TABLE table_with_partition PARTITION (ds = '1') SELECT
key, value, 'test', 1.11 FROM testData)
sql(SELECT * FROM table_with_partition WHERE ds =
'1').collect().foreach(println)
{code}
Actual result:
{noformat}
[1,1,null,null,1]
[2,2,null,null,1]
{noformat}
Expected result:
{noformat}
[1,1,test,1.11,1]
[2,2,test,1.11,1]
{noformat}
was:
In Hive, the schema of a partition may differ from the table schema. For
example, we may add new columns to the table after importing existing
partitions. When using {{spark-sql}} to query the data in a partition whose
schema is different from the table schema, problems may arise. Part of them
have been solved in [PR #4289|https://github.com/apache/spark/pull/4289].
However, after adding new column(s) to the table, when inserting data into old
partitions, values of newly added columns are all {{NULL}}.
The following snippet can be used to reproduce this issue:
{code}
case class TestData(key: Int, value: String)
val testData = TestHive.sparkContext.parallelize((1 to 2).map(i = TestData(i,
i.toString))).toDF()
testData.registerTempTable(testData)
sql(DROP TABLE IF EXISTS table_with_partition )
sql(sCREATE TABLE IF NOT EXISTS table_with_partition(key int, value string)
PARTITIONED by (ds string) location '${tmpDir.toURI.toString}')
sql(INSERT OVERWRITE TABLE table_with_partition PARTITION (ds = '1') SELECT
key, value FROM testData)
// Add new columns to the table
sql(ALTER TABLE table_with_partition ADD COLUMNS(key1 string))
sql(ALTER TABLE table_with_partition ADD COLUMNS(destlng double))
sql(INSERT OVERWRITE TABLE table_with_partition PARTITION (ds = '1') SELECT
key, value, 'test', 1.11 FROM testData)
sql(SELECT * FROM table_with_partition WHERE ds =
'1').collect().foreach(println)
{code}
Actual result:
{noformat}
[1,1,null,null,1]
[2,2,null,null,1]
{noformat}
Expected result:
{noformat}
[1,1,test,1.11,1]
[2,2,test,1.11,1]
{noformat}
After adding new columns to a partitioned table and inserting data to an old
partition, data of newly added columns are all NULL
Key: SPARK-6644
URL: https://issues.apache.org/jira/browse/SPARK-6644
Project: Spark
Issue Type: Bug
Components: SQL
Affects Versions: 1.3.0
Reporter: dongxu
In Hive, the schema of a partition may differ from the table schema. For
example, we may add new columns to the table after importing existing
partitions. When using {{spark-sql}} to query the data in a partition whose
schema is different from the table schema, problems may arise. Part of them
have been solved in [PR #4289|https://github.com/apache/spark/pull/4289].
However, after adding new column(s) to the table, when inserting data into
old partitions, values of newly added columns are all {{NULL}}.
The following snippet can be used to reproduce this issue:
{code}
case class TestData(key: Int, value: String)
val testData = TestHive.sparkContext.parallelize((1 to 2).map(i =
TestData(i, i.toString))).toDF()
testData.registerTempTable(testData)
sql(DROP TABLE IF EXISTS table_with_partition )
sql(sCREATE TABLE IF NOT EXISTS table_with_partition (key INT, value STRING)
PARTITIONED BY (ds STRING) LOCATION '${tmpDir.toURI.toString}')
sql(INSERT OVERWRITE TABLE table_with_partition PARTITION (ds = '1') SELECT
key, value FROM testData)
// Add new columns to the table