Hi I am facing huge performance problem when I am trying to left outer join very big data set (~140GB) with bunch of small lookups [Start schema type]. I am using data frame in spark sql. It looks like data is shuffled and skewed when that join happens. Is there any way to improve performance of such type of join in spark?
How can I hint optimizer to go with replicated join etc., to avoid shuffle? Would it help to create broadcast variables on small lookups? If I create broadcast variables, how can I convert them into data frame and use them in sparksql type of join? Thanks Vijay --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org