Try this setting in your Spark defaults:

spark.sql.autoBroadcastJoinThreshold=-1

I had a similar problem with joins hanging and that resolved it for me. 

You might be able to pass that value from the driver as a --conf option, but I 
have not tried that, and not sure if that will work. 

Sent from my iPad

> On Feb 19, 2016, at 11:31 AM, Tamara Mendt <t...@hellofresh.com> wrote:
> 
> Hi all, 
> 
> I am running a Spark job that gets stuck attempting to join two dataframes. 
> The dataframes are not very large, one is about 2 M rows, and the other a 
> couple of thousand rows and the resulting joined dataframe should be about 
> the same size as the smaller dataframe. I have tried triggering execution of 
> the join using the 'first' operator, which as far as I understand would not 
> require processing the entire resulting dataframe (maybe I am mistaken 
> though). The Spark UI is not telling me anything, just showing the task to be 
> stuck.
> 
> When I run the exact same job on a slightly smaller dataset it works without 
> hanging.
> 
> I have used the same environment to run joins on much larger dataframes, so I 
> am confused as to why in this particular case my Spark job is just hanging. I 
> have also tried running the same join operation using pyspark on two 2 
> Million row dataframes (exactly like the one I am trying to join in the job 
> that gets stuck) and it runs succesfully.
> 
> I have tried caching the joined dataframe to see how much memory it is 
> requiring but the job gets stuck on this action too. I have also tried using 
> persist to memory and disk on the join, and the job seems to be stuck all the 
> same. 
> 
> Any help as to where to look for the source of the problem would be much 
> appreciated.
> 
> Cheers,
> 
> Tamara
> 

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Reply via email to