Anoop Johnson created SPARK-31799: ------------------------------------- Summary: Spark Datasource Tables Creating Incorrect Hive Metadata Key: SPARK-31799 URL: https://issues.apache.org/jira/browse/SPARK-31799 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.4.5 Reporter: Anoop Johnson
If you create a CSV or JSON table using Spark SQL, it writes the wrong Hive metadata, breaking compatibility with other query engines like Hive and Presto. Here is a very simple example: {{CREATE TABLE test_csv (id String, name String)}} {{USING csv}} {{ LOCATION 's3://[...]'}} {{;}} If you describe the table using Presto, you will see: {{CREATE EXTERNAL TABLE `test_csv`(}} {{ `col` array<string> COMMENT 'from deserializer')}} {{ROW FORMAT SERDE }} {{ 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe' }} {{WITH SERDEPROPERTIES ( }} {{ 'path'='s3://...') }} {{STORED AS INPUTFORMAT }} {{ 'org.apache.hadoop.mapred.SequenceFileInputFormat' }} {{OUTPUTFORMAT }} {{ 'org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat'}} {{LOCATION}} {{ 's3://.../test_csv-__PLACEHOLDER__'}} {{TBLPROPERTIES (}} {{ 'spark.sql.create.version'='2.4.4', }} {{ 'spark.sql.sources.provider'='csv', }} {{ 'spark.sql.sources.schema.numParts'='1', }} {{ 'spark.sql.sources.schema.part.0'='\{\"type\":\"struct\",\"fields\":[{\"name\":\"id\",\"type\":\"string\",\"nullable\":true,\"metadata\":{}},\{\"name\":\"name\",\"type\":\"string\",\"nullable\":true,\"metadata\":{}}]}', }} {{ 'transient_lastDdlTime'='1590196086')}} {{ ;}} The table location is set to a placeholder value - the schema is always set to {{`col` array<string>. }}The serde/inputformat is wrong - it says sequence file even though the requested format is CSV. But all the right metadata is written to the custom table properties with prefix `spark.sql`. However, Hive and Presto does not understand these table properties and this breaks them. I could reproduce this with JSON too, but not with Parquet. I root-caused this issue to CSV and JSON tables not handled [here|https://github.com/apache/spark/blob/721cba540292d8d76102b18922dabe2a7d918dc5/sql/core/src/main/scala/org/apache/spark/sql/internal/HiveSerDe.scala#L31-L66] in HiveSerde.scala. Is there a reason why they are not handled? I could send a patch to fix this, but the caveat is that the CSV and JSON Hive serdes should be in the Spark classpath, otherwise the table creation will fail. -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org