Hi,

Below is a create statement on my Hudi dataset.















*CREATE EXTERNAL TABLE `inventory`.`customer`(`_hoodie_commit_time` string,
`_hoodie_commit_seqno` string, `_hoodie_record_key` string,
`_hoodie_partition_path` string, `_hoodie_file_name` string, `id` bigint,
`sales` bigint, `merchant` bigint, `item_status` bigint, `tem_shipment`
bigint)PARTITIONED BY (`dt` string)ROW FORMAT SERDE
'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'WITH
SERDEPROPERTIES (  'serialization.format' = '1')STORED AS  INPUTFORMAT
'org.apache.hudi.hadoop.HoodieParquetInputFormat'  OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'LOCATION
's3://<warehouse-bucket>/<path>/inventory/customer'TBLPROPERTIES (
'bucketing_version' = '2',  'transient_lastDdlTime' = '1572952974',
'last_commit_time_sync' = '20191114192136')*

I've taken care of adding *hudi-hive-bundle-0.5.1-SNAPSHOT.jar* in Hive,
*hudi-presto-bundle-0.5.1-SNAPSHOT.jar* in Presto and
*hudi-spark-bundle-0.5.1-SNAPSHOT.jar
*in Spark (All three share common Metastore).
We are running Hudi in COW mode and we noticed that there are multiple
versions of the .parquet files
written per partitions depending on number of updates coming to them over
each batch execution. When queried from Hive and Presto
for any Primary Key having multiple updates, it returns single record with
latest state(I assume *HoodieParquetInputFormat* does the magic of taking
care of duplicates). Whereas, when I tried to execute the same query
in Spark SQL, I get duplicated records for any Primary Key having multiple
updates.

Can someone help me understand why Spark is not able to handle
deduplication of records across multiple commits which Presto and Hive are
able to do?
I've taken care of providing hudi-spark-bundle-0.5.1-SNAPSHOT.jar while
starting spark-shell. Is there something that I'm missing?

Thanks in advance.

Regards,
Purushotham Pushpavanth

Reply via email to