You could cache the lookup DataFrames, it’ll then do a broadcast join.



On 8/14/15, 9:39 AM, "VIJAYAKUMAR JAWAHARLAL" <sparkh...@data2o.io> wrote:

>Hi
>
>I am facing huge performance problem when I am trying to left outer join very 
>big data set (~140GB) with bunch of small lookups [Start schema type]. I am 
>using data frame  in spark sql. It looks like data is shuffled and skewed when 
>that join happens. Is there any way to improve performance of such type of 
>join in spark? 
>
>How can I hint optimizer to go with replicated join etc., to avoid shuffle? 
>Would it help to create broadcast variables on small lookups?  If I create 
>broadcast variables, how can I convert them into data frame and use them in 
>sparksql type of join?
>
>Thanks
>Vijay
>---------------------------------------------------------------------
>To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>For additional commands, e-mail: user-h...@spark.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Reply via email to