You could cache the lookup DataFrames, it’ll then do a broadcast join.
On 8/14/15, 9:39 AM, "VIJAYAKUMAR JAWAHARLAL" <sparkh...@data2o.io> wrote: >Hi > >I am facing huge performance problem when I am trying to left outer join very >big data set (~140GB) with bunch of small lookups [Start schema type]. I am >using data frame in spark sql. It looks like data is shuffled and skewed when >that join happens. Is there any way to improve performance of such type of >join in spark? > >How can I hint optimizer to go with replicated join etc., to avoid shuffle? >Would it help to create broadcast variables on small lookups? If I create >broadcast variables, how can I convert them into data frame and use them in >sparksql type of join? > >Thanks >Vijay >--------------------------------------------------------------------- >To unsubscribe, e-mail: user-unsubscr...@spark.apache.org >For additional commands, e-mail: user-h...@spark.apache.org > --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org