I may not be correct (in fact I may be completely opposite), but here is my
guess:
Assuming 8 bytes for double, 4000 vectors of dimension 400 for 12k images,
would require 153.6 GB (12k*4000*400*8) of data which may justify the
amount of data to be written to the disk. Without compression, it seem
Dear all,
I have a job that crashes before its end because of no space left on
device, and I noticed that this job generates a lots of temporary data on
my disk.
To be precise, the job is a simple map job that takes a set of images,
extracts local features and save these local features as a seque
Andrew,
Thanks a lot for the pointer to the code! This has answered my question.
Looks like it tries to write it to memory first and then if it doesn't fit,
it spills to disk. I'll have to dig in more to figure out the details.
-Suren
On Wed, Apr 9, 2014 at 12:46 PM, Andrew Ash wrote:
> The
The groupByKey would be aware of the subsequent persist -- that's part of
the reason why operations are lazy. As far as whether it's materialized in
memory first and then flushed to disk vs streamed to disk I'm not sure the
exact behavior.
What I'd expect to happen would be that the RDD is materi
Yes, MEMORY_AND_DISK.
We do a groupByKey and then call persist on the resulting RDD. So I'm
wondering if groupByKey is aware of the subsequent persist setting to use
disk or just creates the Seq[V] in memory and only uses disk after that
data structure is fully realized in memory.
-Suren
On We
Which persistence level are you talking about? MEMORY_AND_DISK ?
Sent from my mobile phone
On Apr 9, 2014 2:28 PM, "Surendranauth Hiraman"
wrote:
> Thanks, Andrew. That helps.
>
> For 1, it sounds like the data for the RDD is held in memory and then only
> written to disk after the entire RDD ha
Thanks, Andrew. That helps.
For 1, it sounds like the data for the RDD is held in memory and then only
written to disk after the entire RDD has been realized in memory. Is that
correct?
-Suren
On Wed, Apr 9, 2014 at 9:25 AM, Andrew Ash wrote:
> For 1, persist can be used to save an RDD to di
For 1, persist can be used to save an RDD to disk using the various
persistence levels. When a persistency level is set on an RDD, when that
RDD is evaluated it's saved to memory/disk/elsewhere so that it can be
re-used. It's applied to that RDD, so that subsequent uses of the RDD can
use the cac
It might help if I clarify my questions. :-)
1. Is persist() applied during the transformation right before the
persist() call in the graph? Or is is applied after the transform's
processing is complete? In the case of things like GroupBy, is the Seq
backed by disk as it is being created? We're tr
Hi,
Any thoughts on this? Thanks.
-Suren
On Thu, Apr 3, 2014 at 8:27 AM, Surendranauth Hiraman <
suren.hira...@velos.io> wrote:
> Hi,
>
> I know if we call persist with the right options, we can have Spark
> persist an RDD's data on disk.
>
> I am wondering what happens in intermediate operat
Hi,
I know if we call persist with the right options, we can have Spark persist
an RDD's data on disk.
I am wondering what happens in intermediate operations that could
conceivably create large collections/Sequences, like GroupBy and shuffling.
Basically, one part of the question is when is disk
11 matches
Mail list logo