Anton, Yes, you can achieve this with Hudi. Hudi uses a HoodieRecordPayload implementation to be able to merge old and new records. You can define a source ordering field (here "sort_key") to govern which record should be chosen as the latest one. The DefaultHoodieRecordPayload supports this -> https://github.com/apache/hudi/blob/master/hudi-common/src/main/java/org/apache/hudi/common/model/DefaultHoodieRecordPayload.java
You just need to set the correct source ordering field name, take a look at an example here -> https://github.com/apache/hudi/blob/master/hudi-common/src/test/java/org/apache/hudi/common/model/TestDefaultHoodieRecordPayload.java#L44 Please create a GH issue or post in the general slack channel for further collaboration if needed. Thanks, Nishith On Sat, Jan 30, 2021 at 6:59 PM Anton Zuyeu <[email protected]> wrote: > Hi Hudi team, > > We are replicating database table by reading table change logs and applying > them to Hudi table, we would like to implement our pipeline so it can > process records out of order. Pretty much we want to introduce column > "sort_key" and only update existing records in the hudi table if a new > record's sort_key is greater than the sort_key value of an existing record. > Initially we thought that we just need to assign to > hoodie.datasource.write.precombine.field > parameter value= "sort_key" , however it looks like it is not the case as > hoodie.datasource.write.precombine.field comes to play only when pre > combining records prior to writing. Is there a way to implement our use > case using hudi's primitives ? > > Thank you, > Anton >
