[ https://issues.apache.org/jira/browse/SPARK-31799?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17124158#comment-17124158 ]
Anoop Johnson edited comment on SPARK-31799 at 6/2/20, 6:09 PM: ---------------------------------------------------------------- Sorry to reopen, but could you please provide some context on why Spark is not able to store the table in a Hive compatible way? The impact of not being able to read data from other engines is major. For CSV tables, I understand that Spark may have custom parsing options that may not be Hive compatible. But for self describing formats like JSON, shouldn't we persist the table metadata in a Hive compatible way? Also, irrespective of the file format, shouldn't we create the right metadata for the table location and schema? was (Author: anoopjohnson): Sorry to reopen, but could you please provide some context on why Spark is not able to store the table in a Hive compatible way? The impact of not being able to read data from other engines is major. For CSV tables, I understand that Spark may have custom parsing options that may not be Hive compatible. But for self describing formats like JSON, shouldn't we persist the table metadata in a Hive compatible way? > Spark Datasource Tables Creating Incorrect Hive Metadata > -------------------------------------------------------- > > Key: SPARK-31799 > URL: https://issues.apache.org/jira/browse/SPARK-31799 > Project: Spark > Issue Type: Bug > Components: SQL > Affects Versions: 2.4.5 > Reporter: Anoop Johnson > Priority: Major > > I found that if I create a CSV or JSON table using Spark SQL, it writes the > wrong Hive table metadata, breaking compatibility with other query engines > like Hive and Presto. Here is a very simple example: > {code:sql} > CREATE TABLE test_csv (id String, name String) > USING csv > LOCATION 's3://[...]' > ; > {code} > If you describe the table using Presto, you will see: > {code:sql} > CREATE EXTERNAL TABLE `test_csv`( > `col` array<string> COMMENT 'from deserializer') > ROW FORMAT SERDE > 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe' > WITH SERDEPROPERTIES ( > 'path'='s3://[...]') > STORED AS INPUTFORMAT > 'org.apache.hadoop.mapred.SequenceFileInputFormat' > OUTPUTFORMAT > 'org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat' > LOCATION > 's3://[...]/test_csv-__PLACEHOLDER__' > TBLPROPERTIES ( > 'spark.sql.create.version'='2.4.4', > 'spark.sql.sources.provider'='csv', > 'spark.sql.sources.schema.numParts'='1', > > 'spark.sql.sources.schema.part.0'='{\"type\":\"struct\",\"fields\":[{\"name\":\"id\",\"type\":\"string\",\"nullable\":true,\"metadata\":{}},{\"name\":\"name\",\"type\":\"string\",\"nullable\":true,\"metadata\":{}}]}', > > 'transient_lastDdlTime'='1590196086') > ; > {code} > The table location is set to a placeholder value - the schema is always set > to _col array<string>_. The serde/inputformat is wrong - it says > _SequenceFileInputFormat_ and _LazySimpleSerDe_ even though the requested > format is CSV. > But all the right metadata is written to the custom table properties with > prefix _spark.sql_. However, Hive and Presto does not understand these table > properties and this breaks them. I could reproduce this with JSON too, but > not with Parquet. > I root-caused this issue to CSV and JSON tables not handled > [here|https://github.com/apache/spark/blob/721cba540292d8d76102b18922dabe2a7d918dc5/sql/core/src/main/scala/org/apache/spark/sql/internal/HiveSerDe.scala#L31-L66] > in HiveSerde.scala. As a result, these default values are written. > Is there a reason why CSV and JSON are not handled? I could send a patch to > fix this, but the caveat is that the CSV and JSON Hive serdes should be in > the Spark classpath, otherwise the table creation will fail. -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org