Hi,

What you mentioned is correct.

@Override
public Optional<IndexedRecord> combineAndGetUpdateValue(IndexedRecord
currentValue, Schema schema)
    throws IOException {
  // combining strategy here trivially ignores currentValue on disk and
writes this record
  return getInsertValue(schema);
}

I think we could change this behavior to match pre-combining. Are you
interested in sending a patch?

Thanks
Vinoth

On Fri, May 17, 2019 at 7:18 AM Vinoth Chandar <vin...@apache.org> wrote:

> Thanks for the clear example. Let me check this out and get back shortly.
>
> On Thu, May 16, 2019 at 5:29 PM Yanjia Li <yanjia.gary...@gmail.com>
> wrote:
>
>> Hello Vinoth,
>>
>> I could add an example here to clarify this question.
>>
>> We have DF1{id:1, ts: 9} and DF2{id:1, ts:1; id:1, ts:2}. We save DF1
>> first, then upsert DF2 to DF1. With the default payload, we will have the
>> final result DF{id:1, ts:2}. But we are looking for DF{id:1, ts:9}. If I
>> didn’t understand wrong, the precombine only combine the data in the delta
>> dataframe, which is DF2 in the example. And the default payload only
>> guarantees that we keep the latest time stamp in the current batch. In
>> this
>> example, the newer data arrived before the older data. We would like to
>> confirm that whether we will need to write our own payload to handle this
>> case. It will also be helpful to know if anyone else had similar issue
>> before.
>>
>> Thanks so much!
>> Gary
>>
>> On Thu, May 16, 2019 at 2:49 PM Vinoth Chandar <vin...@apache.org> wrote:
>>
>> > Hi,
>> >
>> > (Please subscribe to the mailing list, so the message actually comes
>> over
>> > directly to the list.)
>> >
>> > On 1, the default payload overwrites the record on storage with new
>> coming
>> > record, if the precombine field has a higher value. for e.g, if you use
>> > timestamp field, then it will overwrite with latest record while it will
>> > not overwrite if you accidentally write a much older record.
>> >
>> > On 2, I think you can achieve this by setting the precombine key
>> properly..
>> > IIUC, you don't want the older record to overwrite the newer record?
>> >
>> > On 3, you can configure the PRECOMBINE key as documented here
>> > http://hudi.apache.org/configurations.html#PRECOMBINE_FIELD_OPT_KEY ?
>> >
>> > Hope that helps. Please let me know if I missed something.
>> >
>> >
>> > Thanks
>> > Vinoth
>> >
>> > On Thu, May 16, 2019 at 7:07 AM FIXED-TERM Cheng Yuanbin (CR/PJ-AI-S1) <
>> > fixed-term.yuanbin.ch...@us.bosch.com> wrote:
>> >
>> > > Hi Hudi,
>> > >
>> > > We want to use Apache Hudi for immigrating our data pipeline from
>> > batching
>> > > to incremental.
>> > > We face several questions about the Hudi. We so appreciate you can
>> help
>> > us
>> > > figure them out.
>> > >
>> > >
>> > > 1.      In the default Payload (OverwriteWithLatestAvroPayload), this
>> > > payload only concern and merge the records with the same key value in
>> the
>> > > Delta dataframe (new coming records), right?
>> > >
>> > > 2.      In our usage case, we want to keep the latest record in our
>> > > system. However. In the default Payload, if the Delta dataframe
>> contains
>> > > the older record than the record in the record already written in
>> Hudi,
>> > it
>> > > will simply overwrite them, which is not what we want. Do you have
>> some
>> > > suggestions about how to get the global latest record in Hudi?
>> > >
>> > > 3.      We have implemented a custom Payload class in order to get the
>> > > global latest record. However, we found that in the Payload class, we
>> > have
>> > > to hard-code the PRECOMBINE_FIELD_OPT_KEY value in Payload to get the
>> > value
>> > > in currentValue in order to compare them. Can I ask is any method I
>> can
>> > get
>> > > PRECOMBINE_FIELD_OPT_KEY in Payload, or is there any suggested method
>> for
>> > > dealing with this issue.
>> > >
>> > > Thanks so much!
>> > >
>> > > Mit freundlichen Grüßen / Best regards
>> > >
>> > > Yuanbin Cheng
>> > >
>> > >
>> > >
>> >
>>
>

Reply via email to