subject:"\[jira\] \[Comment Edited\] \(SPARK\-14927\) DataFrame. saveAsTable creates RDD partitions but not Hive partitions"

[jira] [Comment Edited] (SPARK-14927) DataFrame. saveAsTable creates RDD partitions but not Hive partitions

2016-10-27 Thread Raul Saez Tapia (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-14927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15610900#comment-15610900
 ] 

Raul Saez Tapia edited comment on SPARK-14927 at 10/27/16 7:36 AM:
---

[~xwu0226] for me is working fine your example with Spark 1.6.1. However it is 
not working when we use UDT.

My DataFrame shows:
{code}
scala> model_date.toDF.show
+++
|date|   model|
+++
|20160610|[aa.bb.spark.types.PersonWrapper@8542...|
|20160610|[aa.bb.spark.types.PersonWrapper@8831..|
...
...
+++
{code}

I have created the table with some specific properties so I can say how is 
defined the table and how to parse from PersonType UDT to table schema:
{code:sql}
create table model_orc (`model` 
struct>) PARTITIONED BY (`date` 
string) ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.orc.OrcSerde' WITH 
SERDEPROPERTIES ('path'='hdfs:///user/raulsaez/model_orc') STORED AS 
INPUTFORMAT 'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat' OUTPUTFORMAT 
'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat' LOCATION 
'hdfs:///user/raulsaez/model_orc' 
TBLPROPERTIES('spark.sql.sources.schema.numParts'='1','spark.sql.sources.schema.part.0'='{
 \"type\":\"struct\",\"fields\":[{ \"name\":\"personWrapper\",\"type\":{ 
\"type\":\"udt\",\"class\":\"aa.bb.spark.types.PersonType\",\"pyClass\":null,\"sqlType\":{
 \"type\":\"struct\",\"fields\":[{ \"name\":\"id\",\"type\":   
\"integer\",\"nullable\":true,\"metadata\":{} } ,{ 
\"name\":\"name\",\"type\":\"string\",\"nullable\":true,\"metadata\":{} }] } 
},\"nullable\":true,\"metadata\":{} }] }')
{code}

Now we insert data into table:
{code}
scala> hiveContext.sql("insert into model_orc partition(date=20160610) select 
model,date from dfJune")
org.apache.spark.sql.AnalysisException: cannot resolve 'cast(model as 
struct>>)'
due to data type mismatch: cannot cast
StructType(StructField(personWrapper,,true)),true)
to
StructType(StructField(person,StructType(StructField(id,IntegerType,true),StructField(name,StringType,true),),true)
{code}
I have the same issue with both Parquet and ORC.



And if I persist the DataFrame as a table with ORC:
{code}
model_date.toDF.write.format("orc").partitionBy("date").saveAsTable("model_orc_asTable")
{code}

Or even if I persist it as a ORC file:
{code}
scala> 
model_date.toDF.write.mode(SaveMode.Append).format("orc").partitionBy("date").save("model_orc")
{code}

I get the ClassCastException:
{code}
Caused by: java.lang.ClassCastException: aa.bb.spark.types.PersonType cannot be 
cast to org.apache.spark.sql.types.StructType
at 
org.apache.spark.sql.hive.HiveInspectors$class.wrap(HiveInspectors.scala:557)
at 
org.apache.spark.sql.hive.orc.OrcOutputWriter.wrap(OrcRelation.scala:66)
at 
org.apache.spark.sql.hive.HiveInspectors$class.wrap(HiveInspectors.scala:568)
at 
org.apache.spark.sql.hive.orc.OrcOutputWriter.wrap(OrcRelation.scala:66)
at 
org.apache.spark.sql.hive.HiveInspectors$$anonfun$wrap$1.apply(HiveInspectors.scala:590)
at 
org.apache.spark.sql.hive.HiveInspectors$$anonfun$wrap$1.apply(HiveInspectors.scala:589)
at 
org.apache.spark.sql.catalyst.util.ArrayData.foreach(ArrayData.scala:135)
at 
org.apache.spark.sql.hive.HiveInspectors$class.wrap(HiveInspectors.scala:589)
at 
org.apache.spark.sql.hive.orc.OrcOutputWriter.wrap(OrcRelation.scala:66)
at 
org.apache.spark.sql.hive.HiveInspectors$class.wrap(HiveInspectors.scala:568)
at 
org.apache.spark.sql.hive.orc.OrcOutputWriter.wrap(OrcRelation.scala:66)
at 
org.apache.spark.sql.hive.orc.OrcOutputWriter.wrapOrcStruct(OrcRelation.scala:128)
at 
org.apache.spark.sql.hive.orc.OrcOutputWriter.writeInternal(OrcRelation.scala:139)
at 
org.apache.spark.sql.execution.datasources.DynamicPartitionWriterContainer.writeRows(WriterContainer.scala:358)
... 8 more
{code}


If I persist the DataFrame as a table with Parquet:
{code}
scala> 
model_date.toDF.write.mode(SaveMode.Append).format("parquet").partitionBy("date").saveAsTable("model_parquet_asTable")
16/10/27 09:39:24 WARN HiveContext$$anon$2: Persisting partitioned data source 
relation `model_parquet_asTable` into Hive metastore in Spark SQL specific 
format, which is NOT compatible with Hive. Input path(s):
hdfs://dev-nameservice/apps/hive/warehouse/model_parquet_astable
...
...
...
scala> hiveContext.sql("select * from model_parquet_asTable where 
date=20160610").show
+++
|   model|date|
+++
|[aa.bb.spark.types.PersonWrapper@8542...|20160610|
|[aa.bb.spark.types.PersonWrapper@8831...|20160610|
|[aa.bb.spark.types.PersonWrapper@3661...|20160610|
...
...
...

[jira] [Comment Edited] (SPARK-14927) DataFrame. saveAsTable creates RDD partitions but not Hive partitions

2016-10-27 Thread Raul Saez Tapia (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-14927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15610900#comment-15610900
 ] 

Raul Saez Tapia edited comment on SPARK-14927 at 10/27/16 6:55 AM:
---

[~xwu0226] for me is working fine your example with Spark 1.6.1. However it is 
not working when we use UDT.

My DataFrame shows:
{code:scala}
scala> model_date.toDF.show
+++
|date|   model|
+++
|20160610|[aa.bb.spark.types.PersonWrapper@8542...|
|20160610|[aa.bb.spark.types.PersonWrapper@8831..|
...
...
+++
{code}

I have created the table with some specific properties so I can say how is 
defined the table and how to parse from PersonType UDT to table schema:
{code:sql}
create table model_orc (`model` 
struct>) PARTITIONED BY (`date` 
string) ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.orc.OrcSerde' WITH 
SERDEPROPERTIES ('path'='hdfs:///user/raulsaez/model_orc') STORED AS 
INPUTFORMAT 'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat' OUTPUTFORMAT 
'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat' LOCATION 
'hdfs:///user/raulsaez/model_orc' 
TBLPROPERTIES('spark.sql.sources.schema.numParts'='1','spark.sql.sources.schema.part.0'='{
 \"type\":\"struct\",\"fields\":[{ \"name\":\"personWrapper\",\"type\":{ 
\"type\":\"udt\",\"class\":\"aa.bb.spark.types.PersonType\",\"pyClass\":null,\"sqlType\":{
 \"type\":\"struct\",\"fields\":[{ \"name\":\"id\",\"type\":   
\"integer\",\"nullable\":true,\"metadata\":{} } ,{ 
\"name\":\"name\",\"type\":\"string\",\"nullable\":true,\"metadata\":{} }] } 
},\"nullable\":true,\"metadata\":{} }] }')
{code}

Now we insert data into table:
{code:scala}
scala> hiveContext.sql("insert into model_orc partition(date=20160610) select 
model,date from dfJune")
org.apache.spark.sql.AnalysisException: cannot resolve 'cast(model as 
struct>>)'
due to data type mismatch: cannot cast
StructType(StructField(personWrapper,,true)),true)
to
StructType(StructField(person,StructType(StructField(id,IntegerType,true),StructField(name,StringType,true),),true)
{code}
I have the same issue with both Parquet and ORC.



And if I persist the DataFrame as a table with ORC:
{code:scala}
model_date.toDF.write.format("orc").partitionBy("date").saveAsTable("model_orc_asTable")
{code}

Or even if I persist it as a ORC file:
{code:scala}
scala> 
model_date.toDF.write.mode(SaveMode.Append).format("orc").partitionBy("date").save("model_orc")
{code}

I get the ClassCastException:
{code:scala}
Caused by: java.lang.ClassCastException: aa.bb.spark.types.PersonType cannot be 
cast to org.apache.spark.sql.types.StructType
at 
org.apache.spark.sql.hive.HiveInspectors$class.wrap(HiveInspectors.scala:557)
at 
org.apache.spark.sql.hive.orc.OrcOutputWriter.wrap(OrcRelation.scala:66)
at 
org.apache.spark.sql.hive.HiveInspectors$class.wrap(HiveInspectors.scala:568)
at 
org.apache.spark.sql.hive.orc.OrcOutputWriter.wrap(OrcRelation.scala:66)
at 
org.apache.spark.sql.hive.HiveInspectors$$anonfun$wrap$1.apply(HiveInspectors.scala:590)
at 
org.apache.spark.sql.hive.HiveInspectors$$anonfun$wrap$1.apply(HiveInspectors.scala:589)
at 
org.apache.spark.sql.catalyst.util.ArrayData.foreach(ArrayData.scala:135)
at 
org.apache.spark.sql.hive.HiveInspectors$class.wrap(HiveInspectors.scala:589)
at 
org.apache.spark.sql.hive.orc.OrcOutputWriter.wrap(OrcRelation.scala:66)
at 
org.apache.spark.sql.hive.HiveInspectors$class.wrap(HiveInspectors.scala:568)
at 
org.apache.spark.sql.hive.orc.OrcOutputWriter.wrap(OrcRelation.scala:66)
at 
org.apache.spark.sql.hive.orc.OrcOutputWriter.wrapOrcStruct(OrcRelation.scala:128)
at 
org.apache.spark.sql.hive.orc.OrcOutputWriter.writeInternal(OrcRelation.scala:139)
at 
org.apache.spark.sql.execution.datasources.DynamicPartitionWriterContainer.writeRows(WriterContainer.scala:358)
... 8 more
{code}


If I persist the DataFrame as a table with Parquet:
{code:scala}
scala> 
model_date.toDF.write.mode(SaveMode.Append).format("parquet").partitionBy("date").saveAsTable("model_parquet_asTable")
16/10/27 09:39:24 WARN HiveContext$$anon$2: Persisting partitioned data source 
relation `model_parquet_asTable` into Hive metastore in Spark SQL specific 
format, which is NOT compatible with Hive. Input path(s):
hdfs://dev-nameservice/apps/hive/warehouse/model_parquet_astable
...
...
...
scala> hiveContext.sql("select * from model_parquet_asTable where 
date=20160610").show
+++
|   model|date|
+++
|[aa.bb.spark.types.PersonWrapper@8542...|20160610|
|[aa.bb.spark.types.PersonWrapper@8831...|20160610|

[jira] [Comment Edited] (SPARK-14927) DataFrame. saveAsTable creates RDD partitions but not Hive partitions

2016-10-27 Thread Raul Saez Tapia (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-14927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15610900#comment-15610900
 ] 

Raul Saez Tapia edited comment on SPARK-14927 at 10/27/16 6:51 AM:
---

[~xwu0226] for me is working fine your example with Spark 1.6.1. However it is 
not working when we use UDT.

My DataFrame shows:
{code:scala}
scala> model_date.toDF.show
+++
|date|   model|
+++
|20160610|[aa.bb.spark.types.PersonWrapper@8542...|
|20160610|[aa.bb.spark.types.PersonWrapper@8831..|
...
...
+++
{code}

I have created the table with some specific properties so I can say how is 
defined the table and how to parse from PersonType UDT to table schema:
{code:sql}
create table model_orc (`model` 
struct>) PARTITIONED BY (`date` 
string) ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.orc.OrcSerde' WITH 
SERDEPROPERTIES ('path'='hdfs:///user/raulsaez/model_orc') STORED AS 
INPUTFORMAT 'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat' OUTPUTFORMAT 
'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat' LOCATION 
'hdfs:///user/raulsaez/model_orc' 
TBLPROPERTIES('spark.sql.sources.schema.numParts'='1','spark.sql.sources.schema.part.0'='{
 \"type\":\"struct\",\"fields\":[{ \"name\":\"personWrapper\",\"type\":{ 
\"type\":\"udt\",\"class\":\"aa.bb.spark.types.PersonType\",\"pyClass\":null,\"sqlType\":{
 \"type\":\"struct\",\"fields\":[{ \"name\":\"id\",\"type\":   
\"integer\",\"nullable\":true,\"metadata\":{} } ,{ 
\"name\":\"name\",\"type\":\"string\",\"nullable\":true,\"metadata\":{} }] } 
},\"nullable\":true,\"metadata\":{} }] }')
{code}

Now we insert data into table:
{code:scala}
scala> hiveContext.sql("insert into model_orc partition(date=20160610) select 
model,date from dfJune")
org.apache.spark.sql.AnalysisException: cannot resolve 'cast(model as 
struct>>)'
due to data type mismatch: cannot cast
StructType(StructField(personWrapper,,true)),true)
to
StructType(StructField(person,StructType(StructField(id,IntegerType,true),StructField(name,StringType,true),),true)
{code}
I have the same issue with both Parquet and ORC.



And if I persist the DataFrame as a table with ORC:
{code:scala}
model_date.toDF.write.format("orc").partitionBy("date").saveAsTable("model_orc_asTable")
{code}

Or even if I persist it as a ORC file:
{code:scala}
scala> 
model_date.toDF.write.mode(SaveMode.Append).format("orc").partitionBy("date").save("model_orc")
{code}

I get the ClassCastException:
{code:scala}
Caused by: java.lang.ClassCastException: aa.bb.spark.types.PersonType cannot be 
cast to org.apache.spark.sql.types.StructType
at 
org.apache.spark.sql.hive.HiveInspectors$class.wrap(HiveInspectors.scala:557)
at 
org.apache.spark.sql.hive.orc.OrcOutputWriter.wrap(OrcRelation.scala:66)
at 
org.apache.spark.sql.hive.HiveInspectors$class.wrap(HiveInspectors.scala:568)
at 
org.apache.spark.sql.hive.orc.OrcOutputWriter.wrap(OrcRelation.scala:66)
at 
org.apache.spark.sql.hive.HiveInspectors$$anonfun$wrap$1.apply(HiveInspectors.scala:590)
at 
org.apache.spark.sql.hive.HiveInspectors$$anonfun$wrap$1.apply(HiveInspectors.scala:589)
at 
org.apache.spark.sql.catalyst.util.ArrayData.foreach(ArrayData.scala:135)
at 
org.apache.spark.sql.hive.HiveInspectors$class.wrap(HiveInspectors.scala:589)
at 
org.apache.spark.sql.hive.orc.OrcOutputWriter.wrap(OrcRelation.scala:66)
at 
org.apache.spark.sql.hive.HiveInspectors$class.wrap(HiveInspectors.scala:568)
at 
org.apache.spark.sql.hive.orc.OrcOutputWriter.wrap(OrcRelation.scala:66)
at 
org.apache.spark.sql.hive.orc.OrcOutputWriter.wrapOrcStruct(OrcRelation.scala:128)
at 
org.apache.spark.sql.hive.orc.OrcOutputWriter.writeInternal(OrcRelation.scala:139)
at 
org.apache.spark.sql.execution.datasources.DynamicPartitionWriterContainer.writeRows(WriterContainer.scala:358)
... 8 more
{code}


If I persist the DataFrame as a table with Parquet:
{code:scala}
scala> 
model_date.toDF.write.mode(SaveMode.Append).format("parquet").partitionBy("date").saveAsTable("model_parquet_asTable")
16/10/27 09:39:24 WARN HiveContext$$anon$2: Persisting partitioned data source 
relation `model_parquet_asTable` into Hive metastore in Spark SQL specific 
format, which is NOT compatible with Hive. Input path(s):
hdfs://dev-nameservice/apps/hive/warehouse/model_parquet_astable
...
...
...
scala> hiveContext.sql("select * from model_parquet_asTable where 
date=20160610").show
+++
|   model|date|
+++
|[aa.bb.spark.types.PersonWrapper@8542...|20160610|
|[aa.bb.spark.types.PersonWrapper@8831...|20160610|

[jira] [Comment Edited] (SPARK-14927) DataFrame. saveAsTable creates RDD partitions but not Hive partitions

2016-10-27 Thread Raul Saez Tapia (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-14927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15610900#comment-15610900
 ] 

Raul Saez Tapia edited comment on SPARK-14927 at 10/27/16 6:50 AM:
---

[~xwu0226] for me is working fine your example with Spark 1.6.1. However it is 
not working when we use UDT.

My DataFrame shows:
{code:scala}
scala> model_date.toDF.show
+++
|date|   model|
+++
|20160610|[aa.bb.spark.types.PersonWrapper@8542...|
|20160610|[aa.bb.spark.types.PersonWrapper@8831..|
...
...
+++
{code}

I have created the table with some specific properties so I can say how is 
defined the table and how to parse from PersonType UDT to table schema:
{code:sql}
create table model_orc (`model` 
struct>) PARTITIONED BY (`date` 
string) ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.orc.OrcSerde' WITH 
SERDEPROPERTIES ('path'='hdfs:///user/raulsaez/model_orc') STORED AS 
INPUTFORMAT 'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat' OUTPUTFORMAT 
'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat' LOCATION 
'hdfs:///user/raulsaez/model_orc' 
TBLPROPERTIES('spark.sql.sources.schema.numParts'='1','spark.sql.sources.schema.part.0'='{
 \"type\":\"struct\",\"fields\":[{ \"name\":\"personWrapper\",\"type\":{ 
\"type\":\"udt\",\"class\":\"aa.bb.spark.types.PersonType\",\"pyClass\":null,\"sqlType\":{
 \"type\":\"struct\",\"fields\":[{ \"name\":\"id\",\"type\":   
\"integer\",\"nullable\":true,\"metadata\":{} } ,{ 
\"name\":\"name\",\"type\":\"string\",\"nullable\":true,\"metadata\":{} }] } 
},\"nullable\":true,\"metadata\":{} }] }')
{code}

Now we insert data into table:
{code:scala}
scala> hiveContext.sql("insert into model_orc partition(date=20160610) select 
model,date from dfJune")
org.apache.spark.sql.AnalysisException: cannot resolve 'cast(model as 
struct>>)'
due to data type mismatch: cannot cast
StructType(StructField(personWrapper,,true)),true)
to
StructType(StructField(person,StructType(StructField(id,IntegerType,true),StructField(name,StringType,true),),true)
{code}
I have the same issue with both Parquet and ORC.



And if I persist the DataFrame as a table with ORC:
{code:scala}
model_date.toDF.write.format("orc").partitionBy("date").saveAsTable("model_orc")
{code}

Or even if I persist it as a ORC file:
{code:scala}
scala> 
model_date.toDF.write.mode(SaveMode.Append).format("orc").partitionBy("date").save("model_orc_asTable")
{code}

I get the ClassCastException:
{code:scala}
Caused by: java.lang.ClassCastException: aa.bb.spark.types.PersonType cannot be 
cast to org.apache.spark.sql.types.StructType
at 
org.apache.spark.sql.hive.HiveInspectors$class.wrap(HiveInspectors.scala:557)
at 
org.apache.spark.sql.hive.orc.OrcOutputWriter.wrap(OrcRelation.scala:66)
at 
org.apache.spark.sql.hive.HiveInspectors$class.wrap(HiveInspectors.scala:568)
at 
org.apache.spark.sql.hive.orc.OrcOutputWriter.wrap(OrcRelation.scala:66)
at 
org.apache.spark.sql.hive.HiveInspectors$$anonfun$wrap$1.apply(HiveInspectors.scala:590)
at 
org.apache.spark.sql.hive.HiveInspectors$$anonfun$wrap$1.apply(HiveInspectors.scala:589)
at 
org.apache.spark.sql.catalyst.util.ArrayData.foreach(ArrayData.scala:135)
at 
org.apache.spark.sql.hive.HiveInspectors$class.wrap(HiveInspectors.scala:589)
at 
org.apache.spark.sql.hive.orc.OrcOutputWriter.wrap(OrcRelation.scala:66)
at 
org.apache.spark.sql.hive.HiveInspectors$class.wrap(HiveInspectors.scala:568)
at 
org.apache.spark.sql.hive.orc.OrcOutputWriter.wrap(OrcRelation.scala:66)
at 
org.apache.spark.sql.hive.orc.OrcOutputWriter.wrapOrcStruct(OrcRelation.scala:128)
at 
org.apache.spark.sql.hive.orc.OrcOutputWriter.writeInternal(OrcRelation.scala:139)
at 
org.apache.spark.sql.execution.datasources.DynamicPartitionWriterContainer.writeRows(WriterContainer.scala:358)
... 8 more
{code}


If I persist the DataFrame as a table with Parquet:
{code:scala}
scala> 
model_date.toDF.write.mode(SaveMode.Append).format("parquet").partitionBy("date").saveAsTable("model_parquet_asTable")
16/10/27 09:39:24 WARN HiveContext$$anon$2: Persisting partitioned data source 
relation `model_parquet_asTable` into Hive metastore in Spark SQL specific 
format, which is NOT compatible with Hive. Input path(s):
hdfs://dev-nameservice/apps/hive/warehouse/model_parquet_astable
...
...
...
scala> hiveContext.sql("select * from model_parquet_asTable where 
date=20160610").show
+++
|   model|date|
+++
|[aa.bb.spark.types.PersonWrapper@8542...|20160610|
|[aa.bb.spark.types.PersonWrapper@8831...|20160610|

[jira] [Comment Edited] (SPARK-14927) DataFrame. saveAsTable creates RDD partitions but not Hive partitions

2016-10-27 Thread Raul Saez Tapia (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-14927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15610900#comment-15610900
 ] 

Raul Saez Tapia edited comment on SPARK-14927 at 10/27/16 6:46 AM:
---

[~xwu0226] for me is working fine your example with Spark 1.6.1. However it is 
not working when we use UDT.

My DataFrame shows:
{code:scala}
scala> model_date.toDF.show
+++
|date|   model|
+++
|20160610|[aa.bb.spark.types.PersonWrapper@8542...|
|20160610|[aa.bb.spark.types.PersonWrapper@8831..|
...
...
+++
{code}

I have created the table with some specific properties so I can say how is 
defined the table and how to parse from PersonType UDT to table schema:
```
create table model_orc (`model` 
struct>) PARTITIONED BY (`date` 
string) ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.orc.OrcSerde' WITH 
SERDEPROPERTIES ('path'='hdfs:///user/raulsaez/model_orc') STORED AS 
INPUTFORMAT 'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat' OUTPUTFORMAT 
'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat' LOCATION 
'hdfs:///user/raulsaez/model_orc' 
TBLPROPERTIES('spark.sql.sources.schema.numParts'='1','spark.sql.sources.schema.part.0'='{
 \"type\":\"struct\",\"fields\":[{ \"name\":\"personWrapper\",\"type\":{ 
\"type\":\"udt\",\"class\":\"aa.bb.spark.types.PersonType\",\"pyClass\":null,\"sqlType\":{
 \"type\":\"struct\",\"fields\":[{ \"name\":\"id\",\"type\":   
\"integer\",\"nullable\":true,\"metadata\":{} } ,{ 
\"name\":\"name\",\"type\":\"string\",\"nullable\":true,\"metadata\":{} }] } 
},\"nullable\":true,\"metadata\":{} }] }')
```

Now we insert data into table:
```
scala> hiveContext.sql("insert into model_orc partition(date=20160610) select 
model,date from dfJune")
org.apache.spark.sql.AnalysisException: cannot resolve 'cast(model as 
struct>>)'
due to data type mismatch: cannot cast
StructType(StructField(personWrapper,,true)),true)
to
StructType(StructField(person,StructType(StructField(id,IntegerType,true),StructField(name,StringType,true),),true)
```
I have the same issue with both Parquet and ORC.



And if I persist the DataFrame as a table with ORC:
```
model_date.toDF.write.format("orc").partitionBy("date").saveAsTable("model_orc")
```
Or even if I persist it as a ORC file:
```
scala> 
model_date.toDF.write.mode(SaveMode.Append).format("orc").partitionBy("date").save("model_orc_asTable")
```

I get the ClassCastException:
```
Caused by: java.lang.ClassCastException: aa.bb.spark.types.PersonType cannot be 
cast to org.apache.spark.sql.types.StructType
at 
org.apache.spark.sql.hive.HiveInspectors$class.wrap(HiveInspectors.scala:557)
at 
org.apache.spark.sql.hive.orc.OrcOutputWriter.wrap(OrcRelation.scala:66)
at 
org.apache.spark.sql.hive.HiveInspectors$class.wrap(HiveInspectors.scala:568)
at 
org.apache.spark.sql.hive.orc.OrcOutputWriter.wrap(OrcRelation.scala:66)
at 
org.apache.spark.sql.hive.HiveInspectors$$anonfun$wrap$1.apply(HiveInspectors.scala:590)
at 
org.apache.spark.sql.hive.HiveInspectors$$anonfun$wrap$1.apply(HiveInspectors.scala:589)
at 
org.apache.spark.sql.catalyst.util.ArrayData.foreach(ArrayData.scala:135)
at 
org.apache.spark.sql.hive.HiveInspectors$class.wrap(HiveInspectors.scala:589)
at 
org.apache.spark.sql.hive.orc.OrcOutputWriter.wrap(OrcRelation.scala:66)
at 
org.apache.spark.sql.hive.HiveInspectors$class.wrap(HiveInspectors.scala:568)
at 
org.apache.spark.sql.hive.orc.OrcOutputWriter.wrap(OrcRelation.scala:66)
at 
org.apache.spark.sql.hive.orc.OrcOutputWriter.wrapOrcStruct(OrcRelation.scala:128)
at 
org.apache.spark.sql.hive.orc.OrcOutputWriter.writeInternal(OrcRelation.scala:139)
at 
org.apache.spark.sql.execution.datasources.DynamicPartitionWriterContainer.writeRows(WriterContainer.scala:358)
... 8 more
```


If I persist the DataFrame as a table with Parquet:
```
scala> 
model_date.toDF.write.mode(SaveMode.Append).format("parquet").partitionBy("date").saveAsTable("model_parquet_asTable")
16/10/27 09:39:24 WARN HiveContext$$anon$2: Persisting partitioned data source 
relation `model_parquet_asTable` into Hive metastore in Spark SQL specific 
format, which is NOT compatible with Hive. Input path(s):
hdfs://dev-nameservice/apps/hive/warehouse/model_parquet_astable
...
...
...
scala> hiveContext.sql("select * from model_parquet_asTable where 
date=20160610").show
+++
|   model|date|
+++
|[aa.bb.spark.types.PersonWrapper@8542...|20160610|
|[aa.bb.spark.types.PersonWrapper@8831...|20160610|
|[aa.bb.spark.types.PersonWrapper@3661...|20160610|
...
...
...
+++
only showing top 20

[jira] [Comment Edited] (SPARK-14927) DataFrame. saveAsTable creates RDD partitions but not Hive partitions

2016-08-25 Thread Xin Wu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-14927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15438457#comment-15438457
 ] 

Xin Wu edited comment on SPARK-14927 at 8/26/16 4:46 AM:
-

[~smilegator] Do you think what you are working on will fix this issue by the 
way? This is to allow hive to see the partitions created by SparkSQL from a 
data frame. 


was (Author: xwu0226):
[~smilegator] Do you think what you are working on regarding will fix this 
issue? This is to allow hive to see the partitions created by SparkSQL from a 
data frame. 

> DataFrame. saveAsTable creates RDD partitions but not Hive partitions
> -
>
> Key: SPARK-14927
> URL: https://issues.apache.org/jira/browse/SPARK-14927
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.2, 1.6.1
> Environment: Mac OS X 10.11.4 local
>Reporter: Sasha Ovsankin
>
> This is a followup to 
> http://stackoverflow.com/questions/31341498/save-spark-dataframe-as-dynamic-partitioned-table-in-hive
>  . I tried to use suggestions in the answers but couldn't make it to work in 
> Spark 1.6.1
> I am trying to create partitions programmatically from `DataFrame. Here is 
> the relevant code (adapted from a Spark test):
> hc.setConf("hive.metastore.warehouse.dir", "tmp/tests")
> //hc.setConf("hive.exec.dynamic.partition", "true")
> //hc.setConf("hive.exec.dynamic.partition.mode", "nonstrict")
> hc.sql("create database if not exists tmp")
> hc.sql("drop table if exists tmp.partitiontest1")
> Seq(2012 -> "a").toDF("year", "val")
>   .write
>   .partitionBy("year")
>   .mode(SaveMode.Append)
>   .saveAsTable("tmp.partitiontest1")
> hc.sql("show partitions tmp.partitiontest1").show
> Full file is here: 
> https://gist.github.com/SashaOv/7c65f03a51c7e8f9c9e018cd42aa4c4a
> I get the error that the table is not partitioned:
> ==
> HIVE FAILURE OUTPUT
> ==
> SET hive.support.sql11.reserved.keywords=false
> SET hive.metastore.warehouse.dir=tmp/tests
> OK
> OK
> FAILED: Execution Error, return code 1 from 
> org.apache.hadoop.hive.ql.exec.DDLTask. Table tmp.partitiontest1 is not a 
> partitioned table
> ==
> It looks like the root cause is that 
> `org.apache.spark.sql.hive.HiveMetastoreCatalog.newSparkSQLSpecificMetastoreTable`
>  always creates table with empty partitions.
> Any help to move this forward is appreciated.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-14927) DataFrame. saveAsTable creates RDD partitions but not Hive partitions

2016-05-02 Thread Xin Wu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-14927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15267576#comment-15267576
 ] 

Xin Wu edited comment on SPARK-14927 at 5/2/16 10:01 PM:
-

right now, when a datasource table is created with partition, it is not a hive 
compatiable table. 

So maybe need to create the table like {code}create table tmp.tmp1 (val string) 
partitioned by (year int) stored as parquet location '' {code}
Then insert into the table with a temp table that is derived from the 
dataframe. Something I tried below 
{code}
scala> df.show
++---+
|year|val|
++---+
|2012|  a|
|2013|  b|
|2014|  c|
++---+

scala> val df1 = spark.sql("select * from t000 where year = 2012")
df1: org.apache.spark.sql.DataFrame = [year: int, val: string]

scala> df1.registerTempTable("df1")

scala> spark.sql("insert into tmp.ptest3 partition(year=2012) select * from 
df1")

scala> val df2 = spark.sql("select * from t000 where year = 2013")
df2: org.apache.spark.sql.DataFrame = [year: int, val: string]

scala> df2.registerTempTable("df2")

scala> spark.sql("insert into tmp.ptest3 partition(year=2013) select val from 
df2")
16/05/02 14:47:34 WARN log: Updating partition stats fast for: ptest3
16/05/02 14:47:34 WARN log: Updated size to 327
res54: org.apache.spark.sql.DataFrame = []

scala> spark.sql("show partitions tmp.ptest3").show
+-+
|   result|
+-+
|year=2012|
|year=2013|
+-+

{code}

This is a bit hacky though. There should be a better solution for your problem. 
And this is on spark 2.0.  Try if 1.6 can take this. 


was (Author: xwu0226):
right now, when a datasource table is created with partition, it is not a hive 
compatiable table. 

So maybe need to create the table like {code}create table tmp.tmp1 (val string) 
partitioned by (year int) stored as parquet location '' {code}
Then insert into the table with a temp table that is derived from the 
dataframe. Something I tried below 
{code}
scala> df.show
++---+
|year|val|
++---+
|2012|  a|
|2013|  b|
|2014|  c|
++---+

scala> val df1 = spark.sql("select * from t000 where year = 2012")
df1: org.apache.spark.sql.DataFrame = [year: int, val: string]

scala> df1.registerTempTable("df1")

scala> spark.sql("insert into tmp.ptest3 partition(year=2012) select * from 
df1")

scala> val df2 = spark.sql("select * from t000 where year = 2013")
df2: org.apache.spark.sql.DataFrame = [year: int, val: string]

scala> df2.registerTempTable("df2")

scala> spark.sql("insert into tmp.ptest3 partition(year=2013) select val from 
df2")
16/05/02 14:47:34 WARN log: Updating partition stats fast for: ptest3
16/05/02 14:47:34 WARN log: Updated size to 327
res54: org.apache.spark.sql.DataFrame = []

scala> spark.sql("show partitions tmp.ptest3").show
+-+
|   result|
+-+
|year=2012|
|year=2013|
+-+

{code}

This is a bit hacky though. hope someone has a better solution for your 
problem. And this is on spark 2.0.  Try if 1.6 can take this. 

> DataFrame. saveAsTable creates RDD partitions but not Hive partitions
> -
>
> Key: SPARK-14927
> URL: https://issues.apache.org/jira/browse/SPARK-14927
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.2, 1.6.1
> Environment: Mac OS X 10.11.4 local
>Reporter: Sasha Ovsankin
>
> This is a followup to 
> http://stackoverflow.com/questions/31341498/save-spark-dataframe-as-dynamic-partitioned-table-in-hive
>  . I tried to use suggestions in the answers but couldn't make it to work in 
> Spark 1.6.1
> I am trying to create partitions programmatically from `DataFrame. Here is 
> the relevant code (adapted from a Spark test):
> hc.setConf("hive.metastore.warehouse.dir", "tmp/tests")
> //hc.setConf("hive.exec.dynamic.partition", "true")
> //hc.setConf("hive.exec.dynamic.partition.mode", "nonstrict")
> hc.sql("create database if not exists tmp")
> hc.sql("drop table if exists tmp.partitiontest1")
> Seq(2012 -> "a").toDF("year", "val")
>   .write
>   .partitionBy("year")
>   .mode(SaveMode.Append)
>   .saveAsTable("tmp.partitiontest1")
> hc.sql("show partitions tmp.partitiontest1").show
> Full file is here: 
> https://gist.github.com/SashaOv/7c65f03a51c7e8f9c9e018cd42aa4c4a
> I get the error that the table is not partitioned:
> ==
> HIVE FAILURE OUTPUT
> ==
> SET hive.support.sql11.reserved.keywords=false
> SET hive.metastore.warehouse.dir=tmp/tests
> OK
> OK
> FAILED: Execution Error, return code 1 from 
> org.apache.hadoop.hive.ql.exec.DDLTask. Table tmp.partitiontest1 is not a 
> partitioned table
> ==
> It looks like the root cause is that 
>

[jira] [Comment Edited] (SPARK-14927) DataFrame. saveAsTable creates RDD partitions but not Hive partitions

[jira] [Comment Edited] (SPARK-14927) DataFrame. saveAsTable creates RDD partitions but not Hive partitions

[jira] [Comment Edited] (SPARK-14927) DataFrame. saveAsTable creates RDD partitions but not Hive partitions

[jira] [Comment Edited] (SPARK-14927) DataFrame. saveAsTable creates RDD partitions but not Hive partitions

[jira] [Comment Edited] (SPARK-14927) DataFrame. saveAsTable creates RDD partitions but not Hive partitions

[jira] [Comment Edited] (SPARK-14927) DataFrame. saveAsTable creates RDD partitions but not Hive partitions

[jira] [Comment Edited] (SPARK-14927) DataFrame. saveAsTable creates RDD partitions but not Hive partitions

7 matches

Site Navigation

Mail list logo

Footer information