Hi Anil,
superb, when I said increase the number of partitions, I was implying
shuffle partitions because you are doing de duplicates by default I think
that should be around 200, which can create issues in case your data volume
is large.
I always prefer to SPARK SQL instead of SPARK dataframes.
I am not sure how to set the records limit. Let me check. I couldn’t find
parquet row group size configuration in spark.
For now, I increased the number if shuffle partitions to reduce the records
processed by task to avoid OOM.
Regards,
Anil
From: Gourav Sengupta
Date: Saturday, March 5,
Hi,
I completely agree with Saurabh, the use of BQ with SPARK does not make
sense at all, if you are trying to cut down your costs. I think that costs
do matter to a few people at the end.
Saurabh, is there any chance you can see what actual queries are hitting
the thrift server? Using hive
Hi Anil,
any chance you tried setting the limit on the number of records to be
written out at a time?
Regards,
Gourav
On Thu, Mar 3, 2022 at 3:12 PM Anil Dasari wrote:
> Hi Gourav,
>
> Tried increasing shuffle partitions number and higher executor memory.
> Both didn’t work.
>
>
>
> Regards
>