[jira] [Commented] (SPARK-19153) DataFrameWriter.saveAsTable should work with hive format to create partitioned table

Shuai Lin (JIRA) Sun, 15 Jan 2017 02:52:32 -0800

    [ 
https://issues.apache.org/jira/browse/SPARK-19153?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15823103#comment-15823103
 ]


Shuai Lin commented on SPARK-19153:
-----------------------------------

I find it's quite straight forward to remove the restriction of partitioned-by 
for the {{create table t1 using hive partitioned by (c1,c2) as select ..."}} 
CTAS statement.

But another problem comes up: the partition columns must be on the right most 
of the schema, otherwise the schema we stored in the table property of 
metastore (with the property key "spark.sql.sources.schema") would be 
inconsistent with the schema we read back from hive client api.

The reason is, when creating a hive table in the metastore, the schema and 
partition columns are disjoint sets (as required by hive client api). And when 
we reading it back, we append the partition columns to the end of the schema to 
get the catalyst schema, i.e.:
{code}
// HiveClientImpl.scala
val partCols = h.getPartCols.asScala.map(fromHiveColumn)
val schema = StructType(h.getCols.asScala.map(fromHiveColumn) ++ partCols)
{code}
It's not a problem before we have the unified "create table" syntax, because in 
the old create hive table syntax we have to specify the normal columns and 
partition columns separately, e.g. {{create table t1 (id int, name string) 
partitioned by (dept string)}} .

Now that we can create partitioned table using hive format, e.g. {{create table 
t1 (id int, name string, dept string) using hive partitioned by (name)}}, the 
partition column may not be the last columns, so I think we need to reorder the 
schema so the partition columns would be the last ones. This is consistent with 
data source tables, e.g.

{code}
scala> sql("create table t1 (id int, name string, dept string) using parquet 
partitioned by (name)")
scala> spark.table("t1").schema.fields.map(_.name)
res44: Array[String] = Array(id, dept, name)
{code}

[~cloud_fan] Does this sound good to you?


> DataFrameWriter.saveAsTable should work with hive format to create 
> partitioned table
> ------------------------------------------------------------------------------------
>
>                 Key: SPARK-19153
>                 URL: https://issues.apache.org/jira/browse/SPARK-19153
>             Project: Spark
>          Issue Type: Sub-task
>          Components: SQL
>            Reporter: Wenchen Fan
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-19153) DataFrameWriter.saveAsTable should work with hive format to create partitioned table

Reply via email to