[jira] [Commented] (SPARK-6644) After adding new columns to a partitioned table and inserting data to an old partition, data of newly added columns are all NULL

Xiu(Joe) Guo (JIRA) Thu, 26 Nov 2015 12:07:45 -0800

    [ 
https://issues.apache.org/jira/browse/SPARK-6644?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15029226#comment-15029226
 ]


Xiu(Joe) Guo commented on SPARK-6644:
-------------------------------------

With the current master branch code line (1.6.0-snapshot), this issue cannot be 
reproduced anymore.

{panel}
scala> sqlContext.sql("DROP TABLE IF EXISTS table_with_partition ")
res6: org.apache.spark.sql.DataFrame = []

scala> sqlContext.sql("CREATE TABLE IF NOT EXISTS table_with_partition (key 
INT, value STRING) PARTITIONED BY (ds STRING)")
res7: org.apache.spark.sql.DataFrame = [result: string]

scala> sqlContext.sql("INSERT OVERWRITE TABLE table_with_partition PARTITION 
(ds = '1') SELECT key, value FROM testData")
res8: org.apache.spark.sql.DataFrame = []

scala> sqlContext.sql("select * from table_with_partition")
res9: org.apache.spark.sql.DataFrame = [key: int, value: string, ds: string]

scala> sqlContext.sql("select * from table_with_partition").show
|key|value| ds|
|  1|    1|  1|
|  2|    2|  1|

scala> sqlContext.sql("ALTER TABLE table_with_partition ADD COLUMNS (key1 
STRING)")
res11: org.apache.spark.sql.DataFrame = [result: string]

scala> sqlContext.sql("ALTER TABLE table_with_partition ADD COLUMNS (destlng 
DOUBLE)") 
res12: org.apache.spark.sql.DataFrame = [result: string]

scala> sqlContext.sql("INSERT OVERWRITE TABLE table_with_partition PARTITION 
(ds = '1') SELECT key, value, 'test', 1.11 FROM testData")
res13: org.apache.spark.sql.DataFrame = []

scala> sqlContext.sql("SELECT * FROM table_with_partition").show
|key|value|key1|destlng| ds|
|  1|    1|test|   1.11|  1|
|  2|    2|test|   1.11|  1|
{panel}

> After adding new columns to a partitioned table and inserting data to an old 
> partition, data of newly added columns are all NULL
> --------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: SPARK-6644
>                 URL: https://issues.apache.org/jira/browse/SPARK-6644
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 1.3.0
>            Reporter: dongxu
>
> In Hive, the schema of a partition may differ from the table schema. For 
> example, we may add new columns to the table after importing existing 
> partitions. When using {{spark-sql}} to query the data in a partition whose 
> schema is different from the table schema, problems may arise. Part of them 
> have been solved in [PR #4289|https://github.com/apache/spark/pull/4289]. 
> However, after adding new column(s) to the table, when inserting data into 
> old partitions, values of newly added columns are all {{NULL}}.
> The following snippet can be used to reproduce this issue:
> {code}
> case class TestData(key: Int, value: String)
> val testData = TestHive.sparkContext.parallelize((1 to 2).map(i => 
> TestData(i, i.toString))).toDF()
> testData.registerTempTable("testData")
> sql("DROP TABLE IF EXISTS table_with_partition ")
> sql(s"CREATE TABLE IF NOT EXISTS table_with_partition (key INT, value STRING) 
> PARTITIONED BY (ds STRING) LOCATION '${tmpDir.toURI.toString}'")
> sql("INSERT OVERWRITE TABLE table_with_partition PARTITION (ds = '1') SELECT 
> key, value FROM testData")
> // Add new columns to the table
> sql("ALTER TABLE table_with_partition ADD COLUMNS (key1 STRING)")
> sql("ALTER TABLE table_with_partition ADD COLUMNS (destlng DOUBLE)") 
> sql("INSERT OVERWRITE TABLE table_with_partition PARTITION (ds = '1') SELECT 
> key, value, 'test', 1.11 FROM testData")
> sql("SELECT * FROM table_with_partition WHERE ds = 
> '1'").collect().foreach(println)    
> {code}
> Actual result:
> {noformat}
> [1,1,null,null,1]
> [2,2,null,null,1]
> {noformat}
> Expected result:
> {noformat}
> [1,1,test,1.11,1]
> [2,2,test,1.11,1]
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-6644) After adding new columns to a partitioned table and inserting data to an old partition, data of newly added columns are all NULL

Reply via email to