You can disable shuffle spill (spark.shuffle.spill
<http://spark.apache.org/docs/latest/configuration.html#shuffle-behavior>)
if you are having enough memory to hold that much data. I believe adding
more resources would be your only choice.

Thanks
Best Regards

On Thu, Jun 11, 2015 at 9:46 PM, Al M <alasdair.mcbr...@gmail.com> wrote:

> I am using Spark on a machine with limited disk space.  I am using it to
> analyze very large (100GB to 1TB per file) data sets stored in HDFS.  When
> I
> analyze these datasets, I will run groups, joins and cogroups.  All of
> these
> operations mean lots of shuffle files written to disk.
>
> Unfortunately what happens is my disk fills up very quickly (I only have
> 40GB free).  Then my process dies because I don't have enough space on
> disk.
> I don't want to write my shuffles to HDFS because it's already pretty full.
> The shuffle files are cleared up between runs, but this doesnt help when a
> single run requires 300GB+ shuffle disk space.
>
> Is there any way that I can limit the amount of disk space used by my
> shuffles?  I could set up a cron job to delete old shuffle files whilst the
> job is still running, but I'm concerned that they are left there for a good
> reason.
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Limit-Spark-Shuffle-Disk-Usage-tp23279.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>

Reply via email to