Hi, Below is a create statement on my Hudi dataset.
*CREATE EXTERNAL TABLE `inventory`.`customer`(`_hoodie_commit_time` string, `_hoodie_commit_seqno` string, `_hoodie_record_key` string, `_hoodie_partition_path` string, `_hoodie_file_name` string, `id` bigint, `sales` bigint, `merchant` bigint, `item_status` bigint, `tem_shipment` bigint)PARTITIONED BY (`dt` string)ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'WITH SERDEPROPERTIES ( 'serialization.format' = '1')STORED AS INPUTFORMAT 'org.apache.hudi.hadoop.HoodieParquetInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'LOCATION 's3://<warehouse-bucket>/<path>/inventory/customer'TBLPROPERTIES ( 'bucketing_version' = '2', 'transient_lastDdlTime' = '1572952974', 'last_commit_time_sync' = '20191114192136')* I've taken care of adding *hudi-hive-bundle-0.5.1-SNAPSHOT.jar* in Hive, *hudi-presto-bundle-0.5.1-SNAPSHOT.jar* in Presto and *hudi-spark-bundle-0.5.1-SNAPSHOT.jar *in Spark (All three share common Metastore). We are running Hudi in COW mode and we noticed that there are multiple versions of the .parquet files written per partitions depending on number of updates coming to them over each batch execution. When queried from Hive and Presto for any Primary Key having multiple updates, it returns single record with latest state(I assume *HoodieParquetInputFormat* does the magic of taking care of duplicates). Whereas, when I tried to execute the same query in Spark SQL, I get duplicated records for any Primary Key having multiple updates. Can someone help me understand why Spark is not able to handle deduplication of records across multiple commits which Presto and Hive are able to do? I've taken care of providing hudi-spark-bundle-0.5.1-SNAPSHOT.jar while starting spark-shell. Is there something that I'm missing? Thanks in advance. Regards, Purushotham Pushpavanth
