[jira] [Commented] (SPARK-14566) When appending to partitioned persisted table, we should apply a projection over input query plan using existing metastore schema

2016-04-12 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14566?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15237721#comment-15237721
 ] 

Apache Spark commented on SPARK-14566:
--

User 'liancheng' has created a pull request for this issue:
https://github.com/apache/spark/pull/12179

> When appending to partitioned persisted table, we should apply a projection 
> over input query plan using existing metastore schema
> -
>
> Key: SPARK-14566
> URL: https://issues.apache.org/jira/browse/SPARK-14566
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Cheng Lian
>Assignee: Cheng Lian
>
> Take the following snippets slightly modified from test case 
> "SQLQuerySuite.SPARK-11453: append data to partitioned table" as an example:
> {code}
> val df1 = Seq("1" -> "10", "2" -> "20").toDF("i", "j")
> df1.write.partitionBy("i").saveAsTable("tbl11453")
> val df2 = Seq("3" -> "30").toDF("i", "j")
> df2.write.mode(SaveMode.Append).partitionBy("i").saveAsTable("tbl11453")
> {code}
> Although {{df1.schema}} is {{}}, schema of persisted 
> table {{tbl11453}} is actually {{}} because {{i}} is a 
> partition column, which is always appended after all data columns. Thus, when 
> appending {{df2}}, schemata of {{df2}} and persisted table {{tbl11453}} are 
> actually different.
> In current master branch, {{CreateMetastoreDataSourceAsSelect}} simply 
> applies existing metastore schema to the input query plan ([see 
> here|https://github.com/apache/spark/blob/75e05a5a964c9585dd09a2ef6178881929bab1f1/sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/commands.scala#L225]),
>  which is wrong. A projection should be used instead to adjust column order 
> here.
> In branch-1.6, [this projection is added in 
> {{InsertIntoHadoopFsRelation}}|https://github.com/apache/spark/blob/663a492f0651d757ea8e5aeb42107e2ece429613/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/InsertIntoHadoopFsRelation.scala#L99-L104],
>  but was removed in Spark 2.0. Replacing the aforementioned line in 
> {{CreateMetastoreDataSourceAsSelect}} with a projection should more 
> preferrable.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14566) When appending to partitioned persisted table, we should apply a projection over input query plan using existing metastore schema

2016-04-12 Thread Cheng Lian (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14566?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15237690#comment-15237690
 ] 

Cheng Lian commented on SPARK-14566:


This bug is exposed after fixing SPARK-14458.

These two bugs together happened to cheated all our existing test cases.

> When appending to partitioned persisted table, we should apply a projection 
> over input query plan using existing metastore schema
> -
>
> Key: SPARK-14566
> URL: https://issues.apache.org/jira/browse/SPARK-14566
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Cheng Lian
>Assignee: Cheng Lian
>
> Take the following snippets slightly modified from test case 
> "SQLQuerySuite.SPARK-11453: append data to partitioned table" as an example:
> {code}
> val df1 = Seq("1" -> "10", "2" -> "20").toDF("i", "j")
> df1.write.partitionBy("i").saveAsTable("tbl11453")
> val df2 = Seq("3" -> "30").toDF("i", "j")
> df2.write.mode(SaveMode.Append).partitionBy("i").saveAsTable("tbl11453")
> {code}
> Although {{df1.schema}} is {{}}, schema of persisted 
> table {{tbl11453}} is actually {{}} because {{i}} is a 
> partition column, which is always appended after all data columns. Thus, when 
> appending {{df2}}, schemata of {{df2}} and persisted table {{tbl11453}} are 
> actually different.
> In current master branch, {{CreateMetastoreDataSourceAsSelect}} simply 
> applies existing metastore schema to the input query plan ([see 
> here|https://github.com/apache/spark/blob/75e05a5a964c9585dd09a2ef6178881929bab1f1/sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/commands.scala#L225]),
>  which is wrong. A projection should be used instead to adjust column order 
> here.
> In branch-1.6, [this projection is added in 
> {{InsertIntoHadoopFsRelation}}|https://github.com/apache/spark/blob/663a492f0651d757ea8e5aeb42107e2ece429613/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/InsertIntoHadoopFsRelation.scala#L99-L104],
>  but was removed in Spark 2.0. Replacing the aforementioned line in 
> {{CreateMetastoreDataSourceAsSelect}} with a projection should more 
> preferrable.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org