[ https://issues.apache.org/jira/browse/HUDI-1415?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Gary Li updated HUDI-1415: -------------------------- Affects Version/s: 0.9.0 > Read Hoodie Table As Spark DataSource Table > -------------------------------------------- > > Key: HUDI-1415 > URL: https://issues.apache.org/jira/browse/HUDI-1415 > Project: Apache Hudi > Issue Type: Bug > Components: Spark Integration > Affects Versions: 0.9.0 > Reporter: pengzhiwei > Assignee: pengzhiwei > Priority: Major > Labels: pull-request-available, user-support-issues > Fix For: 0.8.0 > > > If we update a hudi table twice more, we will get an incorrect query count by > spark sql. > > Currently hudi can sync the meta data to hive meta store using HiveSyncTool. > The table description synced to hive just like this: > {code:java} > CREATE EXTERNAL TABLE `tbl_price_insert0`( > `_hoodie_commit_time` string, > `_hoodie_commit_seqno` string, > `_hoodie_record_key` string, > `_hoodie_partition_path` string, > `_hoodie_file_name` string, > `id` int, > `name` string, > `price` double, > `version` int, > `dt` string) > ROW FORMAT SERDE > 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe' > STORED AS INPUTFORMAT > 'org.apache.hudi.hadoop.HoodieParquetInputFormat' > OUTPUTFORMAT > 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat' > LOCATION > 'file:/tmp/hudi/tbl_price_insert0' > TBLPROPERTIES ( > 'last_commit_time_sync'='20201124105009', > 'transient_lastDdlTime'='1606186222') > {code} > When we query this table using spark sql, it trait it as a Hive Table, not a > spark data source table and convert it to parquet LogicalRelation in > HiveStrategies#RelationConversions. As a result, spark sql read the hudi > table just like a parquet data source. This lead to an incorrect query > result. > Inorder to query hudi table correctly in spark sql, more table properties and > serde properties must be added to the hive meta,just like the follow: > {code:java} > CREATE EXTERNAL TABLE `tbl_price_cow0`( > `_hoodie_commit_time` string, > `_hoodie_commit_seqno` string, > `_hoodie_record_key` string, > `_hoodie_partition_path` string, > `_hoodie_file_name` string, > `id` int, > `name` string, > `price` double, > `version` int) > ROW FORMAT SERDE > 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe' > WITH SERDEPROPERTIES ( > 'path'='/tmp/hudi/tbl_price_cow0') > STORED AS INPUTFORMAT > 'org.apache.hudi.hadoop.HoodieParquetInputFormat' > OUTPUTFORMAT > 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat' > LOCATION > 'file:/tmp/hudi/tbl_price_cow0' > TBLPROPERTIES ( > 'last_commit_time_sync'='20201124120532', > 'spark.sql.sources.provider'='hudi', > 'spark.sql.sources.schema.numParts'='1', > > 'spark.sql.sources.schema.part.0'='{"type":"struct","fields":[{"name":"id","type":"integer","nullable":false,"metadata":{}},{"name":"name","type":"string","nullable":true,"metadata":{}},{"name":"price","type":"double","nullable":false,"metadata":{}},{"name":"version","type":"integer","nullable":false,"metadata":{}}]}', > > 'transient_lastDdlTime'='1606190729') > {code} > These are the missing table properties: > {code:java} > spark.sql.sources.provider= 'hudi' > spark.sql.sources.schema.numParts = 'xx' > spark.sql.sources.schema.part.{num} ='xx' > spark.sql.sources.schema.numPartCols = 'xx' > spark.sql.sources.schema.partCol.{num} = 'xx'{code} > and serde property: > {code:java} > 'path'='/path/to/hudi' > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)