Hi,
you need to cache df1 to prevent re-computation (including disk reads)
because spark re-broadcasts
data every sql execution.
// maropu
On Fri, Aug 26, 2016 at 2:07 AM, Jestin Ma
wrote:
> I have a DataFrame d1 that I would like to join with two separate
>
I have a DataFrame d1 that I would like to join with two separate
DataFrames.
Since d1 is small enough, I broadcast it.
What I understand about cache vs broadcast is that cache leads to each
executor storing the partitions its assigned in memory (cluster-wide
in-memory). Broadcast leads to each