[ https://issues.apache.org/jira/browse/SPARK-31799?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Anoop Johnson updated SPARK-31799: ---------------------------------- Description: If you create a CSV or JSON table using Spark SQL, it writes the wrong Hive metadata, breaking compatibility with other query engines like Hive and Presto. Here is a very simple example: {{{{{code:sql}}}}} {{CREATE TABLE test_csv (id String, name String)}} {{USING csv}} LOCATION 's3://[...]' {{{code}}} If you describe the table using Presto, you will see: The table location is set to a placeholder value - the schema is always set to \{{`col` array<string>. }}The serde/inputformat is wrong - it says sequence file even though the requested format is CSV. But all the right metadata is written to the custom table properties with prefix `spark.sql`. However, Hive and Presto does not understand these table properties and this breaks them. I could reproduce this with JSON too, but not with Parquet. I root-caused this issue to CSV and JSON tables not handled [here|https://github.com/apache/spark/blob/721cba540292d8d76102b18922dabe2a7d918dc5/sql/core/src/main/scala/org/apache/spark/sql/internal/HiveSerDe.scala#L31-L66] in HiveSerde.scala. Is there a reason why they are not handled? I could send a patch to fix this, but the caveat is that the CSV and JSON Hive serdes should be in the Spark classpath, otherwise the table creation will fail. was: If you create a CSV or JSON table using Spark SQL, it writes the wrong Hive metadata, breaking compatibility with other query engines like Hive and Presto. Here is a very simple example: {{{{CREATE TABLE test_csv (id String, name String)}}}} {{ {{USING csv}}}} {{ {{ LOCATION 's3://[...]'}}}} {{ {{;}}}} If you describe the table using Presto, you will see: {{{code:sql}}} {{CREATE EXTERNAL TABLE `test_csv`(}} \{{ `col` array<string> COMMENT 'from deserializer')}} \{{ROW FORMAT SERDE }} \{{ 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe' }} \{{WITH SERDEPROPERTIES ( }} \{{ 'path'='s3://...') }} \{{STORED AS INPUTFORMAT }} \{{ 'org.apache.hadoop.mapred.SequenceFileInputFormat' }} \{{OUTPUTFORMAT }} \{{ 'org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat'}} {{LOCATION}} {{ 's3://.../test_csv-__PLACEHOLDER__'}} {{TBLPROPERTIES (}} \{{ 'spark.sql.create.version'='2.4.4', }} \{{ 'spark.sql.sources.provider'='csv', }} \{{ 'spark.sql.sources.schema.numParts'='1', }} {{ 'spark.sql.sources.schema.part.0'='{\"type\":\"struct\",\"fields\":[\{\"name\":\"id\",\"type\":\"string\",\"nullable\":true,\"metadata\":{}},\\{\"name\":\"name\",\"type\":\"string\",\"nullable\":true,\"metadata\":{}}]}', }} \{{ 'transient_lastDdlTime'='1590196086')}} \{{ ;}} {{{code}}} The table location is set to a placeholder value - the schema is always set to \{{`col` array<string>. }}The serde/inputformat is wrong - it says sequence file even though the requested format is CSV. But all the right metadata is written to the custom table properties with prefix `spark.sql`. However, Hive and Presto does not understand these table properties and this breaks them. I could reproduce this with JSON too, but not with Parquet. I root-caused this issue to CSV and JSON tables not handled [here|https://github.com/apache/spark/blob/721cba540292d8d76102b18922dabe2a7d918dc5/sql/core/src/main/scala/org/apache/spark/sql/internal/HiveSerDe.scala#L31-L66] in HiveSerde.scala. Is there a reason why they are not handled? I could send a patch to fix this, but the caveat is that the CSV and JSON Hive serdes should be in the Spark classpath, otherwise the table creation will fail. > Spark Datasource Tables Creating Incorrect Hive Metadata > -------------------------------------------------------- > > Key: SPARK-31799 > URL: https://issues.apache.org/jira/browse/SPARK-31799 > Project: Spark > Issue Type: Bug > Components: SQL > Affects Versions: 2.4.5 > Reporter: Anoop Johnson > Priority: Major > > If you create a CSV or JSON table using Spark SQL, it writes the wrong Hive > metadata, breaking compatibility with other query engines like Hive and > Presto. Here is a very simple example: > > {{{{{code:sql}}}}} > {{CREATE TABLE test_csv (id String, name String)}} > {{USING csv}} > LOCATION 's3://[...]' > {{{code}}} > If you describe the table using Presto, you will see: > The table location is set to a placeholder value - the schema is always set > to \{{`col` array<string>. }}The serde/inputformat is wrong - it says > sequence file even though the requested format is CSV. > But all the right metadata is written to the custom table properties with > prefix `spark.sql`. However, Hive and Presto does not understand these table > properties and this breaks them. > I could reproduce this with JSON too, but not with Parquet. I root-caused > this issue to CSV and JSON tables not handled > [here|https://github.com/apache/spark/blob/721cba540292d8d76102b18922dabe2a7d918dc5/sql/core/src/main/scala/org/apache/spark/sql/internal/HiveSerDe.scala#L31-L66] > in HiveSerde.scala. > Is there a reason why they are not handled? I could send a patch to fix this, > but the caveat is that the CSV and JSON Hive serdes should be in the Spark > classpath, otherwise the table creation will fail. -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org