[ https://issues.apache.org/jira/browse/SPARK-31799?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17120136#comment-17120136 ]
L. C. Hsieh commented on SPARK-31799: ------------------------------------- This is happened when Spark SQL think it cannot save the data source table in a Hive compatible way. So this kind of data source tables should be only readable by Spark. > Spark Datasource Tables Creating Incorrect Hive Metadata > -------------------------------------------------------- > > Key: SPARK-31799 > URL: https://issues.apache.org/jira/browse/SPARK-31799 > Project: Spark > Issue Type: Bug > Components: SQL > Affects Versions: 2.4.5 > Reporter: Anoop Johnson > Priority: Major > > I found that if I create a CSV or JSON table using Spark SQL, it writes the > wrong Hive table metadata, breaking compatibility with other query engines > like Hive and Presto. Here is a very simple example: > {code:sql} > CREATE TABLE test_csv (id String, name String) > USING csv > LOCATION 's3://[...]' > ; > {code} > If you describe the table using Presto, you will see: > {code:sql} > CREATE EXTERNAL TABLE `test_csv`( > `col` array<string> COMMENT 'from deserializer') > ROW FORMAT SERDE > 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe' > WITH SERDEPROPERTIES ( > 'path'='s3://[...]') > STORED AS INPUTFORMAT > 'org.apache.hadoop.mapred.SequenceFileInputFormat' > OUTPUTFORMAT > 'org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat' > LOCATION > 's3://[...]/test_csv-__PLACEHOLDER__' > TBLPROPERTIES ( > 'spark.sql.create.version'='2.4.4', > 'spark.sql.sources.provider'='csv', > 'spark.sql.sources.schema.numParts'='1', > > 'spark.sql.sources.schema.part.0'='{\"type\":\"struct\",\"fields\":[{\"name\":\"id\",\"type\":\"string\",\"nullable\":true,\"metadata\":{}},{\"name\":\"name\",\"type\":\"string\",\"nullable\":true,\"metadata\":{}}]}', > > 'transient_lastDdlTime'='1590196086') > ; > {code} > The table location is set to a placeholder value - the schema is always set > to _col array<string>_. The serde/inputformat is wrong - it says > _SequenceFileInputFormat_ and _LazySimpleSerDe_ even though the requested > format is CSV. > But all the right metadata is written to the custom table properties with > prefix _spark.sql_. However, Hive and Presto does not understand these table > properties and this breaks them. I could reproduce this with JSON too, but > not with Parquet. > I root-caused this issue to CSV and JSON tables not handled > [here|https://github.com/apache/spark/blob/721cba540292d8d76102b18922dabe2a7d918dc5/sql/core/src/main/scala/org/apache/spark/sql/internal/HiveSerDe.scala#L31-L66] > in HiveSerde.scala. As a result, these default values are written. > Is there a reason why CSV and JSON are not handled? I could send a patch to > fix this, but the caveat is that the CSV and JSON Hive serdes should be in > the Spark classpath, otherwise the table creation will fail. -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org