Hi Zhengxiang, precombine works like this. If there are several rows with the same row_key in an insert/update batch of records, the precombine key will be used to pick the latest value of the same row_key. Taking a really simple example, assume these are the 6 records in the original dataset. |row_key|precombine_key|other columns|...| |abc|1|...|...| |abc|2|...|...| |def|3|...|...| |abc|4|...|...| |def|5|...|...| |ghi|6|...|...|
On applying the precombine the hudi dataset becomes: |abc|4|...|...| |def|5|...|...| |ghi|6|...|...| In this case you will not see all 6 records. It will be reduced to 1 per distinct row_key after applying the precombine logic. I think this is what is happening in your case. I noticed that the precombine key is a string from the snippet. String.compareTo would be used to determine the latest value of strings. Please note that in the above example, I assumed default values for the configs "PAYLOAD_CLASS_OPT_KEY <https://hudi.apache.org/configurations.html#PAYLOAD_CLASS_OPT_KEY>", " PRECOMBINE_FIELD_OPT_KEY <https://hudi.apache.org/configurations.html#PRECOMBINE_FIELD_OPT_KEY>", etc. You can change these configs based on your needs. Can you please verify if this is the case? Thanks, Sudha On Wed, Nov 13, 2019 at 2:11 PM Zhengxiang Pan <[email protected]> wrote: > Hi Sudha, > Yes, I did check, the number of distinct row_key matches. My understanding > is that row_key is not the key to do de-dup. My row_key is not unique, > meaning several rows might have the same row_key, but pre-combine key for > sure is unique. > > Thanks, > Pan > > On Wed, Nov 13, 2019 at 2:54 PM Bhavani Sudha <[email protected]> > wrote: > > > Hi Zhengxiang, > > > > regarding issue 2, were you able to confirm if the number of distinct > > row_key in your original df and the distinct row_key in Hudi dataset > > matches? If that matches, then we can dig into the precombine logic to > see > > whats happening. > > > > Thanks, > > Sudha > > > > On Tue, Nov 12, 2019 at 9:42 AM Zhengxiang Pan <[email protected]> > wrote: > > > > > Hi Balaji.V, > > > W.r.t issue 1), same issue occurs with spark 2.3.4. > > > > > > Pan > > > > > >
