Yuzhou Sun created SPARK-37027: ---------------------------------- Summary: Fix behavior inconsistent in Hive table when ‘path’ is provided in SERDEPROPERTIES Key: SPARK-37027 URL: https://issues.apache.org/jira/browse/SPARK-37027 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.1.2, 2.4.5 Reporter: Yuzhou Sun
If a Hive table is created with both {{WITH SERDEPROPERTIES ('path'='<tableLocation>')}} and {{LOCATION <tableLocation>}}, Spark can return doubled rows when reading the table. This issue seems to be an extension of SPARK-30507. Reproduce steps: # Create table and insert records via Hive (Spark doesn't allow to insert into table like this) {code:sql} CREATE TABLE `test_table`( `c1` LONG, `c2` STRING) ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe' WITH SERDEPROPERTIES ('path'='<tableLocationPath>'" ) STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat' LOCATION '<tableLocationPath>'; INSERT INTO TABLE `test_table` VALUES (0, '0'); SELECT * FROM `test_table`; -- will return -- 0 0 {code} # Read above table from Spark {code:sql} SELECT * FROM `test_table`; -- will return -- 0 0 -- 0 0 {code} But if we set {{spark.sql.hive.convertMetastoreParquet=false}}, Spark will return same result as Hive (i.e. single row) A similar case is that, if a Hive table is created with both {{WITH SERDEPROPERTIES ('path'='<anotherPath>')}} and {{LOCATION <tableLocation>}}, Spark will read both rows under {{anotherPath}} and rows under {{tableLocation}}, regardless of {{spark.sql.hive.convertMetastoreParquet}} ‘s value. However, actually Hive seems to return only rows under {{tableLocation}} Another similar case is that, if {{path}} is provided in {{TBLPROPERTIES}}, Spark won’t double the rows when {{'path'='<tableLocation>'}}. If {{'path'='<anotherPath>'}}, Spark will read both rows under {{anotherPath}} and rows under {{tableLocation}}, Hive seems to keep ignoring the {{path}} in {{TBLPROPERTIES}} Code examples for the above cases (diff patch wrote in {{HiveParquetMetastoreSuite.scala}}) can be found in Attachments -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org