In spark 1.4 there is a parameter to control that. Its default value is 10 M. So you need to cache your dataframe to hint the size. On Aug 14, 2015 7:09 PM, "VIJAYAKUMAR JAWAHARLAL" <sparkh...@data2o.io> wrote:
> Hi > > I am facing huge performance problem when I am trying to left outer join > very big data set (~140GB) with bunch of small lookups [Start schema type]. > I am using data frame in spark sql. It looks like data is shuffled and > skewed when that join happens. Is there any way to improve performance of > such type of join in spark? > > How can I hint optimizer to go with replicated join etc., to avoid > shuffle? Would it help to create broadcast variables on small lookups? If > I create broadcast variables, how can I convert them into data frame and > use them in sparksql type of join? > > Thanks > Vijay > --------------------------------------------------------------------- > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > >