[ 
https://issues.apache.org/jira/browse/HUDI-5272?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17693534#comment-17693534
 ] 

kazdy commented on HUDI-5272:
-----------------------------

Starting from 0.13 precombine field is optional in Spark.
Before this was only available in Flink, but in Flink COMBINE_BEFORE_UPSERT is 
set to false by default and if no precombine field is provided upserts can be 
done without any configuration changes.

In Hudi + Spark, on the other hand, users must explicitly set 
COMBINE_BEFORE_UPSERT option to false first in order to do upserts in absence 
of precombine field.

As a Hudi user, if no precombine field is provided I would like Hudi to 
automatically set the appropriate option of COMBINE_BEFORE_UPSERT, to provide a 
seamless experience.

I assume precombine field can be optional only if the table type is CoW, for 
MoR precombine is required for it to work properly so it's ok to throw an error 
in absence of precombine when operation is upsert.
Therefore this should work only for CoW.
h4.

> Align with Flink to support no_precombine in spark
> --------------------------------------------------
>
>                 Key: HUDI-5272
>                 URL: https://issues.apache.org/jira/browse/HUDI-5272
>             Project: Apache Hudi
>          Issue Type: Improvement
>            Reporter: kazdy
>            Assignee: kazdy
>            Priority: Major
>
> Flink supports {{public static final String NO_PRE_COMBINE = 
> "no_precombine";}} (although not documented) for inserts and updates.
> This was Introduced by [#3874|https://github.com/apache/hudi/pull/3874].
> https://issues.apache.org/jira/browse/HUDI-2633
> {{When the precombine field is not specified, we use the proctime semantics, 
> that means, the records come later are more fresh}}
> There's argument against it, because for updates records cannot be 
> deduplicated properly. But at the same time Hudi allows us to use non-strict 
> insert mode that breaks PK uniqueness.
> Users can make informed decision and handle duplicates on their own or bring 
> in their own precombine logic with window functions etc before triggering 
> hudi write.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to