voonhous commented on PR #8579: URL: https://github.com/apache/hudi/pull/8579#issuecomment-1523206446
> > If preCombine is invoked with the same key when an old data {price: 11.00, _ts:999} is received together with a new data {price: null, _ts: 1001}, the old data's column value might overwrite the existing newer data {price: 10.0, _ts: 1000}. > > This is expected right ? We always ignore the nulls while merging, shouldn't the `#combineAndGetUpdateValue` follow the same convention? `#combineAndGetUpdate` does follow the same convention. The crux of the gotcha here is that if a batch contains multiple records of the same key, it will produce different results when `#combineAndGetUpdateValue` individually. **NOTE:** My bad, the initial precombineField value of the table's initial state is wrong. I've edited the previous examples. Let me provide an example again: # preCombine + combineAndGetUpdateValue ``` Table initial state (1): [1 a1_0 10.0 1000] Table performs an update with an incoming batch that has the following results (2): (preCombine + combineAndGetUpdateValue) [ [1 a1_0 11.0 999], [1 a1_0 null 1001] ] After preCombine results from (2), we will get (3): [1 a1_0 11.0 1001] This will be combineAndGetUpdateValue with (1) to produce: > (1) + (3) [1 a1_0 11.0 1001] ``` # combineAndGetUpdateValue only ``` Table initial state (1): [1 a1_0 10.0 1000] Table performs an update (2): (combineAndGetUpdateValue) [1 a1_0 11.0 999] to produce (3) [NO CHANGE]: [1 a1_0 10.0 1000] Table performs an update again (3): (combineAndGetUpdateValue) [1 a1_0 null 1001] End state of the table: > (2) + (3): [1 a1_0 10.0 1001] ```` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org