Hi Himabindu,

I am assuming your total data on storage is 700GB and not the incoming
batch.
INSERT_DROP_DUPS does work with large data.
However, it is more time-consuming as it needs to tag the incoming records
to dedupe.
I would suggest creating a GitHub issue with Spark UI screenshots and
datasource write configs.
Also, it would be helpful if you could provide your use case for
INSERT_DROP_DUPS.
Maybe there is a better alternative.

Regards,
Sagar


On Thu, Oct 12, 2023 at 3:42 AM Himabindu Kosuru <hkos...@yahoo.com.invalid>
wrote:

> Hi All,
> We are using COW tables and INSERT_DROP_DUPS fails with
> HoodieUpsertException even on a 700 GB data. The data is partitioned and
> stored in GCS.
> Executors 150Exec memory 40gExec cores 8
>
> Does INSERT_DROP_DUPS work with large data? Any recommendations to make it
> work such as spark config settings?
>
> Thanks,Bindu

Reply via email to