I am using Spark on a machine with limited disk space. I am using it to analyze very large (100GB to 1TB per file) data sets stored in HDFS. When I analyze these datasets, I will run groups, joins and cogroups. All of these operations mean lots of shuffle files written to disk.
Unfortunately what happens is my disk fills up very quickly (I only have 40GB free). Then my process dies because I don't have enough space on disk. I don't want to write my shuffles to HDFS because it's already pretty full. The shuffle files are cleared up between runs, but this doesnt help when a single run requires 300GB+ shuffle disk space. Is there any way that I can limit the amount of disk space used by my shuffles? I could set up a cron job to delete old shuffle files whilst the job is still running, but I'm concerned that they are left there for a good reason. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Limit-Spark-Shuffle-Disk-Usage-tp23279.html Sent from the Apache Spark User List mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org