Nope. Count action did not help to choose broadcast join.
All of my tables are hive external tables. So, I tried to trigger compute
statistics from sqlContext.sql. It gives me an error saying “nonsuch table”. I
am not sure that is due to following bug in 1.4.1
Try doing a count on both lookups to force the caching to occur before the join.
On 8/17/15, 12:39 PM, VIJAYAKUMAR JAWAHARLAL sparkh...@data2o.io wrote:
Thanks for your help
I tried to cache the lookup tables and left out join with the big table (DF).
Join does not seem to be using
Thanks for your help
I tried to cache the lookup tables and left out join with the big table (DF).
Join does not seem to be using broadcast join-still it goes with hash partition
join and shuffling big table. Here is the scenario
…
table1 as big_df
left outer join
table2 as lkup
on
Hi
I am facing huge performance problem when I am trying to left outer join very
big data set (~140GB) with bunch of small lookups [Start schema type]. I am
using data frame in spark sql. It looks like data is shuffled and skewed when
that join happens. Is there any way to improve performance
You could cache the lookup DataFrames, it’ll then do a broadcast join.
On 8/14/15, 9:39 AM, VIJAYAKUMAR JAWAHARLAL sparkh...@data2o.io wrote:
Hi
I am facing huge performance problem when I am trying to left outer join very
big data set (~140GB) with bunch of small lookups [Start schema
In spark 1.4 there is a parameter to control that. Its default value is 10
M. So you need to cache your dataframe to hint the size.
On Aug 14, 2015 7:09 PM, VIJAYAKUMAR JAWAHARLAL sparkh...@data2o.io
wrote:
Hi
I am facing huge performance problem when I am trying to left outer join
very big