Hi

I am facing huge performance problem when I am trying to left outer join very 
big data set (~140GB) with bunch of small lookups [Start schema type]. I am 
using data frame  in spark sql. It looks like data is shuffled and skewed when 
that join happens. Is there any way to improve performance of such type of join 
in spark? 

How can I hint optimizer to go with replicated join etc., to avoid shuffle? 
Would it help to create broadcast variables on small lookups?  If I create 
broadcast variables, how can I convert them into data frame and use them in 
sparksql type of join?

Thanks
Vijay
---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Reply via email to