Have you tried shuffle compression?

spark.shuffle.compress (true|false)

if you have a filesystem capable also I’ve noticed file consolidation helps 
disk usage a bit.

spark.shuffle.consolidateFiles (true|false)

Steve

On Jun 24, 2015, at 3:27 PM, Ulanov, Alexander 
<alexander.ula...@hp.com<mailto:alexander.ula...@hp.com>> wrote:

It also fails, as I mentioned in the original question.

From: CC GP [mailto:chandrika.gopalakris...@gmail.com]
Sent: Wednesday, June 24, 2015 12:08 PM
To: Ulanov, Alexander
Cc: dev@spark.apache.org<mailto:dev@spark.apache.org>
Subject: Re: Force inner join to shuffle the smallest table

Try below and see if it makes a difference:

val result = sqlContext.sql(“select big.f1, big.f2 from small inner join big on 
big.s=small.s and big.d=small.d”)

On Wed, Jun 24, 2015 at 11:35 AM, Ulanov, Alexander 
<alexander.ula...@hp.com<mailto:alexander.ula...@hp.com>> wrote:
Hi,

I try to inner join of two tables on two fields(string and double). One table 
is 2B rows, the second is 500K. They are stored in HDFS in Parquet. Spark v 1.4.
val big = sqlContext.paquetFile(“hdfs://big”)
data.registerTempTable(“big”)
val small = sqlContext.paquetFile(“hdfs://small”)
data.registerTempTable(“small”)
val result = sqlContext.sql(“select big.f1, big.f2 from big inner join small on 
big.s=small.s and big.d=small.d”)

This query fails in the middle due to one of the workers “disk out of space” 
with shuffle reported 1.8TB which is the maximum size of my spark working dirs 
(on total 7 worker nodes). This is surprising, because the “big” table takes 
2TB disk space (unreplicated) and “small” about 5GB and I would expect that 
optimizer will shuffle the small table. How to force Spark to shuffle the small 
table? I tried to write “small inner join big” however it also fails with 1.8TB 
of shuffle.

Best regards, Alexander

This e-mail is intended solely for the above-mentioned recipient and it may 
contain confidential or privileged information. If you have received it in 
error, please notify us immediately and delete the e-mail. You must not copy, 
distribute, disclose or take any action in reliance on it. In addition, the 
contents of an attachment to this e-mail may contain software viruses which 
could damage your own computer system. While ColdLight Solutions, LLC has taken 
every reasonable precaution to minimize this risk, we cannot accept liability 
for any damage which you sustain as a result of software viruses. You should 
perform your own virus checks before opening the attachment.

Reply via email to