Try this setting in your Spark defaults: spark.sql.autoBroadcastJoinThreshold=-1
I had a similar problem with joins hanging and that resolved it for me. You might be able to pass that value from the driver as a --conf option, but I have not tried that, and not sure if that will work. Sent from my iPad > On Feb 19, 2016, at 11:31 AM, Tamara Mendt <t...@hellofresh.com> wrote: > > Hi all, > > I am running a Spark job that gets stuck attempting to join two dataframes. > The dataframes are not very large, one is about 2 M rows, and the other a > couple of thousand rows and the resulting joined dataframe should be about > the same size as the smaller dataframe. I have tried triggering execution of > the join using the 'first' operator, which as far as I understand would not > require processing the entire resulting dataframe (maybe I am mistaken > though). The Spark UI is not telling me anything, just showing the task to be > stuck. > > When I run the exact same job on a slightly smaller dataset it works without > hanging. > > I have used the same environment to run joins on much larger dataframes, so I > am confused as to why in this particular case my Spark job is just hanging. I > have also tried running the same join operation using pyspark on two 2 > Million row dataframes (exactly like the one I am trying to join in the job > that gets stuck) and it runs succesfully. > > I have tried caching the joined dataframe to see how much memory it is > requiring but the job gets stuck on this action too. I have also tried using > persist to memory and disk on the join, and the job seems to be stuck all the > same. > > Any help as to where to look for the source of the problem would be much > appreciated. > > Cheers, > > Tamara > --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org