Re: Understanding and optimizing spark disk usage during a job.

2014-11-29 Thread Vikas Agarwal
I may not be correct (in fact I may be completely opposite), but here is my guess: Assuming 8 bytes for double, 4000 vectors of dimension 400 for 12k images, would require 153.6 GB (12k*4000*400*8) of data which may justify the amount of data to be written to the disk. Without compression, it seem

Understanding and optimizing spark disk usage during a job.

2014-11-28 Thread Jaonary Rabarisoa
Dear all, I have a job that crashes before its end because of no space left on device, and I noticed that this job generates a lots of temporary data on my disk. To be precise, the job is a simple map job that takes a set of images, extracts local features and save these local features as a seque

Re: Spark Disk Usage

2014-04-09 Thread Surendranauth Hiraman
Andrew, Thanks a lot for the pointer to the code! This has answered my question. Looks like it tries to write it to memory first and then if it doesn't fit, it spills to disk. I'll have to dig in more to figure out the details. -Suren On Wed, Apr 9, 2014 at 12:46 PM, Andrew Ash wrote: > The

Re: Spark Disk Usage

2014-04-09 Thread Andrew Ash
The groupByKey would be aware of the subsequent persist -- that's part of the reason why operations are lazy. As far as whether it's materialized in memory first and then flushed to disk vs streamed to disk I'm not sure the exact behavior. What I'd expect to happen would be that the RDD is materi

Re: Spark Disk Usage

2014-04-09 Thread Surendranauth Hiraman
Yes, MEMORY_AND_DISK. We do a groupByKey and then call persist on the resulting RDD. So I'm wondering if groupByKey is aware of the subsequent persist setting to use disk or just creates the Seq[V] in memory and only uses disk after that data structure is fully realized in memory. -Suren On We

Re: Spark Disk Usage

2014-04-09 Thread Andrew Ash
Which persistence level are you talking about? MEMORY_AND_DISK ? Sent from my mobile phone On Apr 9, 2014 2:28 PM, "Surendranauth Hiraman" wrote: > Thanks, Andrew. That helps. > > For 1, it sounds like the data for the RDD is held in memory and then only > written to disk after the entire RDD ha

Re: Spark Disk Usage

2014-04-09 Thread Surendranauth Hiraman
Thanks, Andrew. That helps. For 1, it sounds like the data for the RDD is held in memory and then only written to disk after the entire RDD has been realized in memory. Is that correct? -Suren On Wed, Apr 9, 2014 at 9:25 AM, Andrew Ash wrote: > For 1, persist can be used to save an RDD to di

Re: Spark Disk Usage

2014-04-09 Thread Andrew Ash
For 1, persist can be used to save an RDD to disk using the various persistence levels. When a persistency level is set on an RDD, when that RDD is evaluated it's saved to memory/disk/elsewhere so that it can be re-used. It's applied to that RDD, so that subsequent uses of the RDD can use the cac

Re: Spark Disk Usage

2014-04-07 Thread Surendranauth Hiraman
It might help if I clarify my questions. :-) 1. Is persist() applied during the transformation right before the persist() call in the graph? Or is is applied after the transform's processing is complete? In the case of things like GroupBy, is the Seq backed by disk as it is being created? We're tr

Re: Spark Disk Usage

2014-04-07 Thread Surendranauth Hiraman
Hi, Any thoughts on this? Thanks. -Suren On Thu, Apr 3, 2014 at 8:27 AM, Surendranauth Hiraman < suren.hira...@velos.io> wrote: > Hi, > > I know if we call persist with the right options, we can have Spark > persist an RDD's data on disk. > > I am wondering what happens in intermediate operat

Spark Disk Usage

2014-04-03 Thread Surendranauth Hiraman
Hi, I know if we call persist with the right options, we can have Spark persist an RDD's data on disk. I am wondering what happens in intermediate operations that could conceivably create large collections/Sequences, like GroupBy and shuffling. Basically, one part of the question is when is disk