[ https://issues.apache.org/jira/browse/HUDI-1196?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Vinoth Chandar updated HUDI-1196: --------------------------------- Fix Version/s: 0.7.0 > Record being placed in incorrect partition during upsert on COW/MOR global > indexed tables > ----------------------------------------------------------------------------------------- > > Key: HUDI-1196 > URL: https://issues.apache.org/jira/browse/HUDI-1196 > Project: Apache Hudi > Issue Type: Bug > Reporter: Ryan Pifer > Assignee: Ryan Pifer > Priority: Major > Labels: pull-request-available > Fix For: 0.7.0 > > > When upserting a record in a global index table (global and hbase) where a > single batch has multiple versions of the record in different partitions, the > record is deduplicated correctly but placed in the incorrect partition. This > was with using "hoodie.bloom.update.partition.path=true" as well > > Batch with multiple versions of a record in different partitions: > ``` > scala> val inputDF = spark.read.format("parquet").load(inputDataPath).show() > +---------+--------++-----------------------------++------------- > > | wbn| cs_ss| action_date| ad| ad_updated| > +---------+--------++-----------------------------++------------- > |12345678|InTransit|1596716921000601|2020-08-06-12|2020-08-06-12| > |12345678| Pending|1596716921000602|2020-08-06-12|2020-08-06-12| > |12345678| Pending|1596716921000603|2020-08-06-13|2020-08-06-13| > +---------+--------++-----------------------------++------------- > ``` > > Values when querying _rt and _ro tables: > ``` > scala> spark.sql("select * from gb_update_partition_1_ro").show() > +--------------------+-------------------++----------------------------------------++----------------------------++-----------------------++--------------------------+ > |_hoodie_commit_time|_hoodie_commit_seqno|_hoodie_record_key|_hoodie_partition_path| > _hoodie_file_name| wbn| cs_ss| action_date| ad_updated| ad| > +--------------------+-------------------++----------------------------------------++----------------------------++-----------------------++--------------------------+ > | 20200817220935| 20200817220935_0_1| 12345678| > 2020-08-06-12|4dddb6e8-87c4-4bd...|12345678|Pending|1596716921000603|2020-08-06-13|2020-08-06-12| > +--------------------+-------------------++----------------------------------------++----------------------------++-----------------------++--------------------------+ > > scala> spark.sql("select * from gb_update_partition_1_rt").show() > +--------------------+-------------------++----------------------------------------++----------------------------++-----------------------++--------------------------+ > |_hoodie_commit_time|_hoodie_commit_seqno|_hoodie_record_key|_hoodie_partition_path| > _hoodie_file_name| wbn| cs_ss| action_date| ad_updated| ad| > +--------------------+-------------------++----------------------------------------++----------------------------++-----------------------++--------------------------+ > | 20200817221924| 20200817221924_0_1| 12345678| > 2020-08-06-12|4dddb6e8-87c4-4bd...|12345678|Pending|1596716921000603|2020-08-06-13|2020-08-06-12| > +--------------------+-------------------++----------------------------------------++----------------------------++-----------------------++--------------------------+ > ``` > > We can see that record displays most current version of the data except the > partition values are from the older versions > -- This message was sent by Atlassian Jira (v8.3.4#803005)