Re: Caching broadcasted DataFrames?

2016-08-25 Thread Takeshi Yamamuro
Hi, you need to cache df1 to prevent re-computation (including disk reads) because spark re-broadcasts data every sql execution. // maropu On Fri, Aug 26, 2016 at 2:07 AM, Jestin Ma wrote: > I have a DataFrame d1 that I would like to join with two separate >

Caching broadcasted DataFrames?

2016-08-25 Thread Jestin Ma
I have a DataFrame d1 that I would like to join with two separate DataFrames. Since d1 is small enough, I broadcast it. What I understand about cache vs broadcast is that cache leads to each executor storing the partitions its assigned in memory (cluster-wide in-memory). Broadcast leads to each