Re: [pyspark 2.4] broadcasting DataFrame throws error

Amit Joshi Thu, 17 Sep 2020 20:36:28 -0700

Hi,

I think problem lies with driver memory. Broadcast in spark work by
collecting all the data to driver and then driver broadcasting to all the
executors. Different strategy could be employed for trasfer like bit
torrent though.


Please try increasing the driver memory. See if it works.

Regards,
Amit


On Thursday, September 17, 2020, Rishi Shah <rishishah.s...@gmail.com>
wrote:

> Hello All,
>
> Hope this email finds you well. I have a dataframe of size 8TB (parquet
> snappy compressed), however I group it by a column and get a much smaller
> aggregated dataframe of size 700 rows (just two columns, key and count).
> When I use it like below to broadcast this aggregated result, it throws
> dataframe can not be broadcasted error.
>
> df_agg = df.groupBy('column1').count().cache()
> # df_agg.count()
> df_join = df.join(broadcast(df_agg), 'column1', 'left_outer')
> df_join.write.parquet('PATH')
>
> The same code works with input df size of 3TB without any modifications.
>
> Any suggestions?
>
> --
> Regards,
>
> Rishi Shah
>

Re: [pyspark 2.4] broadcasting DataFrame throws error

Reply via email to