Re: {EXT} Re: Spark Parquet write OOM

2022-03-05 Thread Gourav Sengupta
Hi Anil, superb, when I said increase the number of partitions, I was implying shuffle partitions because you are doing de duplicates by default I think that should be around 200, which can create issues in case your data volume is large. I always prefer to SPARK SQL instead of SPARK dataframes.

Re: {EXT} Re: Spark Parquet write OOM

2022-03-05 Thread Anil Dasari
I am not sure how to set the records limit. Let me check. I couldn’t find parquet row group size configuration in spark. For now, I increased the number if shuffle partitions to reduce the records processed by task to avoid OOM. Regards, Anil From: Gourav Sengupta Date: Saturday, March 5,

Re: [EXTERNAL] Re: Need to make WHERE clause compulsory in Spark SQL

2022-03-05 Thread Gourav Sengupta
Hi, I completely agree with Saurabh, the use of BQ with SPARK does not make sense at all, if you are trying to cut down your costs. I think that costs do matter to a few people at the end. Saurabh, is there any chance you can see what actual queries are hitting the thrift server? Using hive

Re: {EXT} Re: Spark Parquet write OOM

2022-03-05 Thread Gourav Sengupta
Hi Anil, any chance you tried setting the limit on the number of records to be written out at a time? Regards, Gourav On Thu, Mar 3, 2022 at 3:12 PM Anil Dasari wrote: > Hi Gourav, > > Tried increasing shuffle partitions number and higher executor memory. > Both didn’t work. > > > > Regards >