Re: Broadcast join data reuse

2020-06-15 Thread gypsysunny
The broadcasted table can't seem to be resued across multiple actions. e.g. val small_df_bc = broadcast(small_df) big_df1.join(small_df_bc, Seq("id")).write.parquet("/test1") big_df2.join(small_df_bc, Seq("id")).write.parquet("/test2") we can tell the small df has been distributed twice in the

Re: Broadcast join data reuse

2020-06-11 Thread Ankur Srivastava
Hi Tyson, The broadcast variable should remain in-memory of the executors and reused unless you unpersist, destroy it or it goes out of context. Hope this helps. Thanks Ankur On Wed, Jun 10, 2020 at 5:28 PM wrote: > We have a case where data the is small enough to be broadcasted in joined >

Broadcast join data reuse

2020-06-10 Thread tcondie
We have a case where data the is small enough to be broadcasted in joined with multiple tables in a single plan. Looking at the physical plan, I do not see anything that indicates if the broadcast data is done only once i.e., the BroadcastExchange is being reused i.i.e., that data is not