Join requires shuffling. The problem is that you have to shuffle 1.5T data, 
which caused problem on your disk usage. Another way is to broadcast the 1.5G 
small dataset, so there is no shuffle requirement for 1.5T dataset. But you 
need to make sure you have enough memory.


Can you try to increase your partition count, which will make each partition 
contains less data for your 1.5T, so the whole disk usage of split data maybe 
less.


But keep in mind you should always have enough space of your disk to handle the 
job you plan to run.


Yong


________________________________
From: Ashic Mahtab <as...@live.com>
Sent: Monday, August 8, 2016 2:53 PM
To: Deepak Sharma
Cc: Apache Spark
Subject: RE: Spark join and large temp files

Hi Deepak,
Thanks for the response.

Registering the temp tables didn't help. Here's what I have:

val a = sqlContext..read.parquet(...).select("eid.id", 
"name").withColumnRenamed("eid.id", "id")
val b = sqlContext.read.parquet(...).select("id", "number")

a.registerTempTable("a")
b.registerTempTable("b")

val results = sqlContext.sql("SELECT x.id, x.name, y.number FROM a x join b y 
on x.id=y.id)

results.write.parquet(...)

Is there something I'm missing?

Cheers,
Ashic.

________________________________
From: deepakmc...@gmail.com
Date: Tue, 9 Aug 2016 00:01:32 +0530
Subject: Re: Spark join and large temp files
To: as...@live.com
CC: user@spark.apache.org

Register you dataframes as temp tables and then try the join on the temp table.
This should resolve your issue.

Thanks
Deepak

On Mon, Aug 8, 2016 at 11:47 PM, Ashic Mahtab 
<as...@live.com<mailto:as...@live.com>> wrote:
Hello,
We have two parquet inputs of the following form:

a: id:String, Name:String  (1.5TB)
b: id:String, Number:Int  (1.3GB)

We need to join these two to get (id, Number, Name). We've tried two approaches:

a.join(b, Seq("id"), "right_outer")

where a and b are dataframes. We also tried taking the rdds, mapping them to 
pair rdds with id as the key, and then joining. What we're seeing is that temp 
file usage is increasing on the join stage, and filling up our disks, causing 
the job to crash. Is there a way to join these two data sets without 
well...crashing?

Note, the ids are unique, and there's a one to one mapping between the two 
datasets.

Any help would be appreciated.

-Ashic.







--
Thanks
Deepak
www.bigdatabig.com<http://www.bigdatabig.com>
www.keosha.net<http://www.keosha.net>

Reply via email to