Limit Spark Shuffle Disk Usage

Al M Thu, 11 Jun 2015 09:17:40 -0700

I am using Spark on a machine with limited disk space.  I am using it to
analyze very large (100GB to 1TB per file) data sets stored in HDFS.  When I
analyze these datasets, I will run groups, joins and cogroups.  All of these
operations mean lots of shuffle files written to disk.


Unfortunately what happens is my disk fills up very quickly (I only have
40GB free).  Then my process dies because I don't have enough space on disk. 
I don't want to write my shuffles to HDFS because it's already pretty full. 
The shuffle files are cleared up between runs, but this doesnt help when a
single run requires 300GB+ shuffle disk space.

Is there any way that I can limit the amount of disk space used by my
shuffles?  I could set up a cron job to delete old shuffle files whilst the
job is still running, but I'm concerned that they are left there for a good
reason.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Limit-Spark-Shuffle-Disk-Usage-tp23279.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Limit Spark Shuffle Disk Usage

Reply via email to