Hi Himabindu, I am assuming your total data on storage is 700GB and not the incoming batch. INSERT_DROP_DUPS does work with large data. However, it is more time-consuming as it needs to tag the incoming records to dedupe. I would suggest creating a GitHub issue with Spark UI screenshots and datasource write configs. Also, it would be helpful if you could provide your use case for INSERT_DROP_DUPS. Maybe there is a better alternative.
Regards, Sagar On Thu, Oct 12, 2023 at 3:42 AM Himabindu Kosuru <hkos...@yahoo.com.invalid> wrote: > Hi All, > We are using COW tables and INSERT_DROP_DUPS fails with > HoodieUpsertException even on a 700 GB data. The data is partitioned and > stored in GCS. > Executors 150Exec memory 40gExec cores 8 > > Does INSERT_DROP_DUPS work with large data? Any recommendations to make it > work such as spark config settings? > > Thanks,Bindu