[ https://issues.apache.org/jira/browse/SPARK-37027?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Yuzhou Sun updated SPARK-37027: ------------------------------- Attachment: SPARK-37027-test-example.patch > Fix behavior inconsistent in Hive table when ‘path’ is provided in > SERDEPROPERTIES > ---------------------------------------------------------------------------------- > > Key: SPARK-37027 > URL: https://issues.apache.org/jira/browse/SPARK-37027 > Project: Spark > Issue Type: Improvement > Components: SQL > Affects Versions: 2.4.5, 3.1.2 > Reporter: Yuzhou Sun > Priority: Trivial > Attachments: SPARK-37027-test-example.patch > > > If a Hive table is created with both {{WITH SERDEPROPERTIES > ('path'='<tableLocation>')}} and {{LOCATION <tableLocation>}}, Spark can > return doubled rows when reading the table. This issue seems to be an > extension of SPARK-30507. > Reproduce steps: > # Create table and insert records via Hive (Spark doesn't allow to insert > into table like this) > {code:sql} > CREATE TABLE `test_table`( > `c1` LONG, > `c2` STRING) > ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe' > WITH SERDEPROPERTIES ('path'='<tableLocationPath>'" ) > STORED AS > INPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat' > OUTPUTFORMAT > 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat' > LOCATION '<tableLocationPath>'; > INSERT INTO TABLE `test_table` > VALUES (0, '0'); > SELECT * FROM `test_table`; > -- will return > -- 0 0 > {code} > # Read above table from Spark > {code:sql} > SELECT * FROM `test_table`; > -- will return > -- 0 0 > -- 0 0 > {code} > But if we set {{spark.sql.hive.convertMetastoreParquet=false}}, Spark will > return same result as Hive (i.e. single row) > A similar case is that, if a Hive table is created with both {{WITH > SERDEPROPERTIES ('path'='<anotherPath>')}} and {{LOCATION <tableLocation>}}, > Spark will read both rows under {{anotherPath}} and rows under > {{tableLocation}}, regardless of {{spark.sql.hive.convertMetastoreParquet}} > ‘s value. However, actually Hive seems to return only rows under > {{tableLocation}} > Another similar case is that, if {{path}} is provided in {{TBLPROPERTIES}}, > Spark won’t double the rows when {{'path'='<tableLocation>'}}. If > {{'path'='<anotherPath>'}}, Spark will read both rows under {{anotherPath}} > and rows under {{tableLocation}}, Hive seems to keep ignoring the {{path}} in > {{TBLPROPERTIES}} > Code examples for the above cases (diff patch wrote in > {{HiveParquetMetastoreSuite.scala}}) can be found in Attachments -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org