Register you dataframes as temp tables and then try the join on the temp table. This should resolve your issue.
Thanks Deepak On Mon, Aug 8, 2016 at 11:47 PM, Ashic Mahtab <as...@live.com> wrote: > Hello, > We have two parquet inputs of the following form: > > a: id:String, Name:String (1.5TB) > b: id:String, Number:Int (1.3GB) > > We need to join these two to get (id, Number, Name). We've tried two > approaches: > > a.join(b, Seq("id"), "right_outer") > > where a and b are dataframes. We also tried taking the rdds, mapping them > to pair rdds with id as the key, and then joining. What we're seeing is > that temp file usage is increasing on the join stage, and filling up our > disks, causing the job to crash. Is there a way to join these two data sets > without well...crashing? > > Note, the ids are unique, and there's a one to one mapping between the two > datasets. > > Any help would be appreciated. > > -Ashic. > > > > > -- Thanks Deepak www.bigdatabig.com www.keosha.net