Hi, are you setting the path filters when you query the Hudi Hive table via Spark http://hudi.apache.org/querying_data.html#spark-ro-view (or http://hudi.apache.org/querying_data.html#spark-rt-view alternatively)?
- Vinoth On Fri, Nov 15, 2019 at 5:03 AM Purushotham Pushpavanthar < pushpavant...@gmail.com> wrote: > Hi, > > Below is a create statement on my Hudi dataset. > > > > > > > > > > > > > > > > *CREATE EXTERNAL TABLE `inventory`.`customer`(`_hoodie_commit_time` string, > `_hoodie_commit_seqno` string, `_hoodie_record_key` string, > `_hoodie_partition_path` string, `_hoodie_file_name` string, `id` bigint, > `sales` bigint, `merchant` bigint, `item_status` bigint, `tem_shipment` > bigint)PARTITIONED BY (`dt` string)ROW FORMAT SERDE > 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'WITH > SERDEPROPERTIES ( 'serialization.format' = '1')STORED AS INPUTFORMAT > 'org.apache.hudi.hadoop.HoodieParquetInputFormat' OUTPUTFORMAT > 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'LOCATION > 's3://<warehouse-bucket>/<path>/inventory/customer'TBLPROPERTIES ( > 'bucketing_version' = '2', 'transient_lastDdlTime' = '1572952974', > 'last_commit_time_sync' = '20191114192136')* > > I've taken care of adding *hudi-hive-bundle-0.5.1-SNAPSHOT.jar* in Hive, > *hudi-presto-bundle-0.5.1-SNAPSHOT.jar* in Presto and > *hudi-spark-bundle-0.5.1-SNAPSHOT.jar > *in Spark (All three share common Metastore). > We are running Hudi in COW mode and we noticed that there are multiple > versions of the .parquet files > written per partitions depending on number of updates coming to them over > each batch execution. When queried from Hive and Presto > for any Primary Key having multiple updates, it returns single record with > latest state(I assume *HoodieParquetInputFormat* does the magic of taking > care of duplicates). Whereas, when I tried to execute the same query > in Spark SQL, I get duplicated records for any Primary Key having multiple > updates. > > Can someone help me understand why Spark is not able to handle > deduplication of records across multiple commits which Presto and Hive are > able to do? > I've taken care of providing hudi-spark-bundle-0.5.1-SNAPSHOT.jar while > starting spark-shell. Is there something that I'm missing? > > Thanks in advance. > > Regards, > Purushotham Pushpavanth >