[ https://issues.apache.org/jira/browse/HUDI-1347?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
sivabalan narayanan updated HUDI-1347: -------------------------------------- Status: In Progress (was: Open) > Hbase index partition changes cause data duplication problems > ------------------------------------------------------------- > > Key: HUDI-1347 > URL: https://issues.apache.org/jira/browse/HUDI-1347 > Project: Apache Hudi > Issue Type: Bug > Components: Index > Reporter: jing > Assignee: jing > Priority: Major > Labels: pull-request-available, sev:critical, user-support-issues > > 1,A piece of data repeatedly changes the partition. After the data > deduplication operation, the partition information of the key and data in the > HoodieRecord object is inconsistent. > E.g: > id,oid,name,dt,isdeleted,lastupdatedttm,rowkey > 9,1,aaaa,2018,0,2020-02-17 00:50:25.000001,00_test1-9-1 > 9,1,aaaa,2019,0,2020-02-17 00:50:25.000002,00_test1-9-1 > rowkey is the primary key and dt is the partition. After deduplication, the > key of the HoodieRecord object is (00_test1-9-1,2018).The key should be > (00_test1-9-1,2019) > 2,An exception in the hudi task caused the hbase index to be written > successfully but the task failed. If the task is retried, the partition > change data becomes only a new creation. The data before the partition change > is not deleted. > Solution: > 1,Fixed the error of partition information in HoodieRecord key caused by > deduplication operation > 2.The hbase index adds a rollback operation instead of doing nothing. The > partition change needs to be rolledback to the index of the last successful > commit。 > 3.Rich test cases > > -- This message was sent by Atlassian Jira (v8.3.4#803005)