Hi Vinoth,

Thank you for your guidance.

I went through the code for RepairsCommand in Hudi-cli package which
internally calls DedupeSparkJob.scala. The logic therein basically marks
the file as bad based on the commit time of records. However in my case
even the commit time is same for duplicates. The only thing that varies is
`_hoodie_commit_seqno` and `_hoodie_file_name`. So I guess this class will
not help me.

IIUC the logic in DedupeSparkJob can only work when duplicates were created
due to INSERT operation. If we have UPDATE coming in for some duplicate
record, then both the files where that record is present will have the same
commit time henceforth. Such cases cannot be dealt with by considering
`_hoodie_commit_time` which is the same as I am experiencing.

I have written one script to solve my use case. It is a no brainer where I
simply delete the duplicate keys and rewrite the file. Wanted to check if
it would add any value to our code base and if I should raise a PR for the
same. If the community agrees, then we can work together to further improve
it and make it generic enough.

On Mon, Apr 13, 2020 at 8:22 PM Vinoth Chandar <vin...@apache.org> wrote:

> Hi Pratyaksh,
>
> Your understanding is correct. There is a duplicate fix tool in the cli (I
> wrote this a while ago for cow, but did use it in production few times for
> situations like these). Check that out? IIRC it will keep the both the
> commits and its files, but simply get rid of the duplicate records and
> replace parquet files in place.
>
> >> Also once duplicates are written, you
> are not sure of which file the update will go to next since the record is
> already present in 2 different parquet files.
>
> IIRC bloom index will tag both files and both will be updated.
>
> Table could show many side effects depending on when exactly the race
> happened.
>
> - the second commit may have rolled back the first inflight commit and
> mistaking it for a failed write. In this case, some data may also be
> missing. In this case though i expect first commit to actually fail since
> files got deleted midway into writing.
> - if both of them indeed succeeded, then then its just the duplicates
>
>
> Thanks
> Vinoth
>
>
>
>
>
> On Mon, Apr 13, 2020 at 6:12 AM Pratyaksh Sharma <pratyaks...@gmail.com>
> wrote:
>
> > Hi,
> >
> > From my experience so far of working with Hudi, I understand that Hudi is
> > not designed to handle concurrent writes from 2 different sources for
> > example 2 instances of HoodieDeltaStreamer are simultaneously running and
> > writing to the same dataset. I have experienced such a case can result in
> > duplicate writes in case of inserts. Also once duplicates are written,
> you
> > are not sure of which file the update will go to next since the record is
> > already present in 2 different parquet files. Please correct me if I am
> > wrong.
> >
> > Having experienced this in few Hudi datasets, I now want to delete one of
> > the parquet files which contains duplicates in some partition of a COW
> type
> > Hudi dataset. I want to know if deleting a parquet file manually can have
> > any repercussions? If yes, what all can be the side effects of doing the
> > same?
> >
> > Any leads will be highly appreciated.
> >
>

Reply via email to