Shuffle files lifecycle

2015-06-29 Thread Thomas Gerber
Hello, It is my understanding that shuffle are written on disk and that they act as checkpoints. I wonder if this is true only within a job, or across jobs. Please note that I use the words job and stage carefully here. 1. can a shuffle created during JobN be used to skip many stages from

Re: Shuffle files lifecycle

2015-06-29 Thread Thomas Gerber
Ah, for #3, maybe this is what *rdd.checkpoint *does! https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.rdd.RDD Thomas On Mon, Jun 29, 2015 at 7:12 PM, Thomas Gerber thomas.ger...@radius.com wrote: Hello, It is my understanding that shuffle are written on disk and

Re: Shuffle files lifecycle

2015-06-29 Thread Thomas Gerber
stages in the job UI. They are periodically cleaned up based on available space of the configured spark.local.dirs paths. From: Thomas Gerber Date: Monday, June 29, 2015 at 10:12 PM To: user Subject: Shuffle files lifecycle Hello, It is my understanding that shuffle are written

Re: Shuffle files lifecycle

2015-06-29 Thread Silvio Fiorito
spark.local.dirs paths. From: Thomas Gerber Date: Monday, June 29, 2015 at 10:12 PM To: user Subject: Shuffle files lifecycle Hello, It is my understanding that shuffle are written on disk and that they act as checkpoints. I wonder if this is true only within a job, or across jobs. Please note that I