[jira] [Updated] (SPARK-31799) Spark Datasource Tables Creating Incorrect Hive Metadata

Anoop Johnson (Jira) Fri, 22 May 2020 18:34:45 -0700


     [ 
https://issues.apache.org/jira/browse/SPARK-31799?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Anoop Johnson updated SPARK-31799:
----------------------------------
    Description: 
If you create a CSV or JSON table using Spark SQL, it writes the  wrong Hive 
metadata, breaking compatibility with other query engines like Hive and Presto. 
Here is a very simple example:

 

{{{{{code:sql}}}}}

{{CREATE TABLE test_csv (id String, name String)}}
 {{USING csv}}
 LOCATION 's3://[...]'

{{{code}}}

If you describe the table using Presto, you will see:

 The table location is set to a placeholder value - the schema is always set to 
\{{`col` array<string>. }}The serde/inputformat is wrong - it says sequence 
file even though the requested format is CSV.

But all the right metadata is written to the custom table properties with 
prefix `spark.sql`. However, Hive and Presto does not understand these table 
properties and this breaks them.

I could reproduce this with JSON too, but not with Parquet. I root-caused this 
issue to CSV and JSON tables not handled 
[here|https://github.com/apache/spark/blob/721cba540292d8d76102b18922dabe2a7d918dc5/sql/core/src/main/scala/org/apache/spark/sql/internal/HiveSerDe.scala#L31-L66]
 in HiveSerde.scala.

Is there a reason why they are not handled? I could send a patch to fix this, 
but the caveat is that the CSV and JSON Hive serdes should be in the Spark 
classpath, otherwise the table creation will fail.

  was:
If you create a CSV or JSON table using Spark SQL, it writes the  wrong Hive 
metadata, breaking compatibility with other query engines like Hive and Presto. 
Here is a very simple example:

{{{{CREATE TABLE test_csv (id String, name String)}}}}
{{ {{USING csv}}}}
{{ {{ LOCATION 's3://[...]'}}}}
{{ {{;}}}}

If you describe the table using Presto, you will see:

 

{{{code:sql}}}

{{CREATE EXTERNAL TABLE `test_csv`(}}
 \{{ `col` array<string> COMMENT 'from deserializer')}}
 \{{ROW FORMAT SERDE }}
 \{{ 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe' }}
 \{{WITH SERDEPROPERTIES ( }}
 \{{ 'path'='s3://...') }}
 \{{STORED AS INPUTFORMAT }}
 \{{ 'org.apache.hadoop.mapred.SequenceFileInputFormat' }}
 \{{OUTPUTFORMAT }}
 \{{ 'org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat'}}
 {{LOCATION}}
 {{ 's3://.../test_csv-__PLACEHOLDER__'}}
 {{TBLPROPERTIES (}}
 \{{ 'spark.sql.create.version'='2.4.4', }}
 \{{ 'spark.sql.sources.provider'='csv', }}
 \{{ 'spark.sql.sources.schema.numParts'='1', }}
 {{ 
'spark.sql.sources.schema.part.0'='{\"type\":\"struct\",\"fields\":[\{\"name\":\"id\",\"type\":\"string\",\"nullable\":true,\"metadata\":{}},\\{\"name\":\"name\",\"type\":\"string\",\"nullable\":true,\"metadata\":{}}]}',
 }}
 \{{ 'transient_lastDdlTime'='1590196086')}}
 \{{ ;}}

{{{code}}}

The table location is set to a placeholder value - the schema is always set to 
\{{`col` array<string>. }}The serde/inputformat is wrong - it says sequence 
file even though the requested format is CSV.

But all the right metadata is written to the custom table properties with 
prefix `spark.sql`. However, Hive and Presto does not understand these table 
properties and this breaks them.

I could reproduce this with JSON too, but not with Parquet. I root-caused this 
issue to CSV and JSON tables not handled 
[here|https://github.com/apache/spark/blob/721cba540292d8d76102b18922dabe2a7d918dc5/sql/core/src/main/scala/org/apache/spark/sql/internal/HiveSerDe.scala#L31-L66]
 in HiveSerde.scala.

Is there a reason why they are not handled? I could send a patch to fix this, 
but the caveat is that the CSV and JSON Hive serdes should be in the Spark 
classpath, otherwise the table creation will fail.


> Spark Datasource Tables Creating Incorrect Hive Metadata
> --------------------------------------------------------
>
>                 Key: SPARK-31799
>                 URL: https://issues.apache.org/jira/browse/SPARK-31799
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 2.4.5
>            Reporter: Anoop Johnson
>            Priority: Major
>
> If you create a CSV or JSON table using Spark SQL, it writes the  wrong Hive 
> metadata, breaking compatibility with other query engines like Hive and 
> Presto. Here is a very simple example:
>  
> {{{{{code:sql}}}}}
> {{CREATE TABLE test_csv (id String, name String)}}
>  {{USING csv}}
>  LOCATION 's3://[...]'
> {{{code}}}
> If you describe the table using Presto, you will see:
>  The table location is set to a placeholder value - the schema is always set 
> to \{{`col` array<string>. }}The serde/inputformat is wrong - it says 
> sequence file even though the requested format is CSV.
> But all the right metadata is written to the custom table properties with 
> prefix `spark.sql`. However, Hive and Presto does not understand these table 
> properties and this breaks them.
> I could reproduce this with JSON too, but not with Parquet. I root-caused 
> this issue to CSV and JSON tables not handled 
> [here|https://github.com/apache/spark/blob/721cba540292d8d76102b18922dabe2a7d918dc5/sql/core/src/main/scala/org/apache/spark/sql/internal/HiveSerDe.scala#L31-L66]
>  in HiveSerde.scala.
> Is there a reason why they are not handled? I could send a patch to fix this, 
> but the caveat is that the CSV and JSON Hive serdes should be in the Spark 
> classpath, otherwise the table creation will fail.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-31799) Spark Datasource Tables Creating Incorrect Hive Metadata

Reply via email to