Hi,

>From my experience so far of working with Hudi, I understand that Hudi is
not designed to handle concurrent writes from 2 different sources for
example 2 instances of HoodieDeltaStreamer are simultaneously running and
writing to the same dataset. I have experienced such a case can result in
duplicate writes in case of inserts. Also once duplicates are written, you
are not sure of which file the update will go to next since the record is
already present in 2 different parquet files. Please correct me if I am
wrong.

Having experienced this in few Hudi datasets, I now want to delete one of
the parquet files which contains duplicates in some partition of a COW type
Hudi dataset. I want to know if deleting a parquet file manually can have
any repercussions? If yes, what all can be the side effects of doing the
same?

Any leads will be highly appreciated.

Reply via email to