Hi Deepak,Thanks for the response.
Registering the temp tables didn't help. Here's what I have:
val a = sqlContext..read.parquet(...).select("eid.id",
"name").withColumnRenamed("eid.id", "id")val b =
sqlContext.read.parquet(...).select("id", "number")
a.registerTempTable("a")b.registerTempTable("b")
val results = sqlContext.sql("SELECT x.id, x.name, y.number FROM a x join b y
on x.id=y.id)
results.write.parquet(...)
Is there something I'm missing?
Cheers,Ashic.
From: [email protected]
Date: Tue, 9 Aug 2016 00:01:32 +0530
Subject: Re: Spark join and large temp files
To: [email protected]
CC: [email protected]
Register you dataframes as temp tables and then try the join on the temp
table.This should resolve your issue.
ThanksDeepak
On Mon, Aug 8, 2016 at 11:47 PM, Ashic Mahtab <[email protected]> wrote:
Hello,We have two parquet inputs of the following form:
a: id:String, Name:String (1.5TB)b: id:String, Number:Int (1.3GB)
We need to join these two to get (id, Number, Name). We've tried two approaches:
a.join(b, Seq("id"), "right_outer")
where a and b are dataframes. We also tried taking the rdds, mapping them to
pair rdds with id as the key, and then joining. What we're seeing is that temp
file usage is increasing on the join stage, and filling up our disks, causing
the job to crash. Is there a way to join these two data sets without
well...crashing?
Note, the ids are unique, and there's a one to one mapping between the two
datasets.
Any help would be appreciated.
-Ashic.
--
Thanks
Deepak
www.bigdatabig.com
www.keosha.net