[jira] [Comment Edited] (SPARK-14927) DataFrame. saveAsTable creates RDD partitions but not Hive partitions

Xin Wu (JIRA) Mon, 02 May 2016 15:03:07 -0700

    [ 
https://issues.apache.org/jira/browse/SPARK-14927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15267576#comment-15267576
 ]


Xin Wu edited comment on SPARK-14927 at 5/2/16 10:01 PM:
---------------------------------------------------------

right now, when a datasource table is created with partition, it is not a hive 
compatiable table. 

So maybe need to create the table like {code}create table tmp.tmp1 (val string) 
partitioned by (year int) stored as parquet location '....' {code}
Then insert into the table with a temp table that is derived from the 
dataframe. Something I tried below 
{code}
scala> df.show
+----+---+
|year|val|
+----+---+
|2012|  a|
|2013|  b|
|2014|  c|
+----+---+

scala> val df1 = spark.sql("select * from t000 where year = 2012")
df1: org.apache.spark.sql.DataFrame = [year: int, val: string]

scala> df1.registerTempTable("df1")

scala> spark.sql("insert into tmp.ptest3 partition(year=2012) select * from 
df1")

scala> val df2 = spark.sql("select * from t000 where year = 2013")
df2: org.apache.spark.sql.DataFrame = [year: int, val: string]

scala> df2.registerTempTable("df2")

scala> spark.sql("insert into tmp.ptest3 partition(year=2013) select val from 
df2")
16/05/02 14:47:34 WARN log: Updating partition stats fast for: ptest3
16/05/02 14:47:34 WARN log: Updated size to 327
res54: org.apache.spark.sql.DataFrame = []

scala> spark.sql("show partitions tmp.ptest3").show
+---------+
|   result|
+---------+
|year=2012|
|year=2013|
+---------+

{code}

This is a bit hacky though. There should be a better solution for your problem. 
And this is on spark 2.0.  Try if 1.6 can take this. 


was (Author: xwu0226):
right now, when a datasource table is created with partition, it is not a hive 
compatiable table. 

So maybe need to create the table like {code}create table tmp.tmp1 (val string) 
partitioned by (year int) stored as parquet location '....' {code}
Then insert into the table with a temp table that is derived from the 
dataframe. Something I tried below 
{code}
scala> df.show
+----+---+
|year|val|
+----+---+
|2012|  a|
|2013|  b|
|2014|  c|
+----+---+

scala> val df1 = spark.sql("select * from t000 where year = 2012")
df1: org.apache.spark.sql.DataFrame = [year: int, val: string]

scala> df1.registerTempTable("df1")

scala> spark.sql("insert into tmp.ptest3 partition(year=2012) select * from 
df1")

scala> val df2 = spark.sql("select * from t000 where year = 2013")
df2: org.apache.spark.sql.DataFrame = [year: int, val: string]

scala> df2.registerTempTable("df2")

scala> spark.sql("insert into tmp.ptest3 partition(year=2013) select val from 
df2")
16/05/02 14:47:34 WARN log: Updating partition stats fast for: ptest3
16/05/02 14:47:34 WARN log: Updated size to 327
res54: org.apache.spark.sql.DataFrame = []

scala> spark.sql("show partitions tmp.ptest3").show
+---------+
|   result|
+---------+
|year=2012|
|year=2013|
+---------+

{code}

This is a bit hacky though. hope someone has a better solution for your 
problem. And this is on spark 2.0.  Try if 1.6 can take this. 

> DataFrame. saveAsTable creates RDD partitions but not Hive partitions
> ---------------------------------------------------------------------
>
>                 Key: SPARK-14927
>                 URL: https://issues.apache.org/jira/browse/SPARK-14927
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 1.5.2, 1.6.1
>         Environment: Mac OS X 10.11.4 local
>            Reporter: Sasha Ovsankin
>
> This is a followup to 
> http://stackoverflow.com/questions/31341498/save-spark-dataframe-as-dynamic-partitioned-table-in-hive
>  . I tried to use suggestions in the answers but couldn't make it to work in 
> Spark 1.6.1
> I am trying to create partitions programmatically from `DataFrame. Here is 
> the relevant code (adapted from a Spark test):
>     hc.setConf("hive.metastore.warehouse.dir", "tmp/tests")
>     //    hc.setConf("hive.exec.dynamic.partition", "true")
>     //    hc.setConf("hive.exec.dynamic.partition.mode", "nonstrict")
>     hc.sql("create database if not exists tmp")
>     hc.sql("drop table if exists tmp.partitiontest1")
>     Seq(2012 -> "a").toDF("year", "val")
>       .write
>       .partitionBy("year")
>       .mode(SaveMode.Append)
>       .saveAsTable("tmp.partitiontest1")
>     hc.sql("show partitions tmp.partitiontest1").show
> Full file is here: 
> https://gist.github.com/SashaOv/7c65f03a51c7e8f9c9e018cd42aa4c4a
> I get the error that the table is not partitioned:
>     ======================
>     HIVE FAILURE OUTPUT
>     ======================
>     SET hive.support.sql11.reserved.keywords=false
>     SET hive.metastore.warehouse.dir=tmp/tests
>     OK
>     OK
>     FAILED: Execution Error, return code 1 from 
> org.apache.hadoop.hive.ql.exec.DDLTask. Table tmp.partitiontest1 is not a 
> partitioned table
>     ======================
> It looks like the root cause is that 
> `org.apache.spark.sql.hive.HiveMetastoreCatalog.newSparkSQLSpecificMetastoreTable`
>  always creates table with empty partitions.
> Any help to move this forward is appreciated.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-14927) DataFrame. saveAsTable creates RDD partitions but not Hive partitions

Reply via email to