Thank you Sagar! Here is the issue - https://github.com/apache/hudi/issues/9859
On Friday, October 13, 2023 at 01:52:24 AM EDT, sagar sumit
<[email protected]> wrote:
Hi Himabindu,
I am assuming your total data on storage is 700GB and not the incoming batch.
INSERT_DROP_DUPS does work with large data. However, it is more time-consuming
as it needs to tag the incoming records to dedupe.
I would suggest creating a GitHub issue with Spark UI screenshots and
datasource write configs.Also, it would be helpful if you could provide your
use case for INSERT_DROP_DUPS.Maybe there is a better alternative.
Regards,Sagar
On Thu, Oct 12, 2023 at 3:42 AM Himabindu Kosuru <[email protected]>
wrote:
Hi All,
We are using COW tables and INSERT_DROP_DUPS fails with HoodieUpsertException
even on a 700 GB data. The data is partitioned and stored in GCS.
Executors 150Exec memory 40gExec cores 8
Does INSERT_DROP_DUPS work with large data? Any recommendations to make it work
such as spark config settings?
Thanks,Bindu