[jira] [Comment Edited] (SPARK-31799) Spark Datasource Tables Creating Incorrect Hive Metadata

Anoop Johnson (Jira) Tue, 02 Jun 2020 11:11:05 -0700


    [ 
https://issues.apache.org/jira/browse/SPARK-31799?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17124158#comment-17124158
 ]


Anoop Johnson edited comment on SPARK-31799 at 6/2/20, 6:09 PM:
----------------------------------------------------------------

Sorry to reopen, but could you please provide some context on why Spark is not 
able to store the table in a Hive compatible way? The impact of not being able 
to read data from other engines is major.

For CSV tables, I understand that Spark may have custom parsing options that 
may not be Hive compatible. But for self describing formats like JSON, 
shouldn't we persist the table metadata in a Hive compatible way?

Also, irrespective of the file format, shouldn't we create the right metadata 
for the table location and schema?


was (Author: anoopjohnson):
Sorry to reopen, but could you please provide some context on why Spark is not 
able to store the table in a Hive compatible way? The impact of not being able 
to read data from other engines is major.

For CSV tables, I understand that Spark may have custom parsing options that 
may not be Hive compatible. But for self describing formats like JSON, 
shouldn't we persist the table metadata in a Hive compatible way?

> Spark Datasource Tables Creating Incorrect Hive Metadata
> --------------------------------------------------------
>
>                 Key: SPARK-31799
>                 URL: https://issues.apache.org/jira/browse/SPARK-31799
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 2.4.5
>            Reporter: Anoop Johnson
>            Priority: Major
>
> I found that if I create a CSV or JSON table using Spark SQL, it writes the 
> wrong Hive table metadata, breaking compatibility with other query engines 
> like Hive and Presto. Here is a very simple example:
> {code:sql}
> CREATE TABLE test_csv (id String, name String)
> USING csv
>   LOCATION  's3://[...]'
> ;
> {code}
> If you describe the table using Presto, you will see:
> {code:sql}
> CREATE EXTERNAL TABLE `test_csv`(
>   `col` array<string> COMMENT 'from deserializer')
> ROW FORMAT SERDE 
>   'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe' 
> WITH SERDEPROPERTIES ( 
>   'path'='s3://[...]') 
> STORED AS INPUTFORMAT 
>   'org.apache.hadoop.mapred.SequenceFileInputFormat' 
> OUTPUTFORMAT 
>   'org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat'
> LOCATION
>   's3://[...]/test_csv-__PLACEHOLDER__'
> TBLPROPERTIES (
>   'spark.sql.create.version'='2.4.4', 
>   'spark.sql.sources.provider'='csv', 
>   'spark.sql.sources.schema.numParts'='1', 
>   
> 'spark.sql.sources.schema.part.0'='{\"type\":\"struct\",\"fields\":[{\"name\":\"id\",\"type\":\"string\",\"nullable\":true,\"metadata\":{}},{\"name\":\"name\",\"type\":\"string\",\"nullable\":true,\"metadata\":{}}]}',
>  
>   'transient_lastDdlTime'='1590196086')
>   ;
> {code}
>  The table location is set to a placeholder value - the schema is always set 
> to _col array<string>_. The serde/inputformat is wrong - it says 
> _SequenceFileInputFormat_ and _LazySimpleSerDe_ even though the requested 
> format is CSV.
> But all the right metadata is written to the custom table properties with 
> prefix _spark.sql_. However, Hive and Presto does not understand these table 
> properties and this breaks them. I could reproduce this with JSON too, but 
> not with Parquet. 
> I root-caused this issue to CSV and JSON tables not handled 
> [here|https://github.com/apache/spark/blob/721cba540292d8d76102b18922dabe2a7d918dc5/sql/core/src/main/scala/org/apache/spark/sql/internal/HiveSerDe.scala#L31-L66]
>  in HiveSerde.scala. As a result, these default values are written.
> Is there a reason why CSV and JSON are not handled? I could send a patch to 
> fix this, but the caveat is that the CSV and JSON Hive serdes should be in 
> the Spark classpath, otherwise the table creation will fail.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-31799) Spark Datasource Tables Creating Incorrect Hive Metadata

Reply via email to