disk-backing pyspark rdds?

Eric Jonas Tue, 21 Oct 2014 10:48:07 -0700

Hi All!
    I'm getting my feet wet with pySpark for the fairly boring case of
doing parameter sweeps for monte carlo runs. Each of my functions runs for
a very long time (2h+) and return numpy arrays on the order of ~100 MB.
That is, my spark applications look like


def foo(x):
    np.random.seed(x)
    eat_2GB_of_ram()
    take_2h()
    return my_100MB_array

sc.parallelize(np.arange(100)).map(f).saveAsPickleFile("s3n://blah...")

The resulting rdds will most likely not fit in memory but for this use case
I don't really care. I know I can persist RDDs, but is there any way to
by-default disk-back them (something analogous to mmap?) so that they don't
create memory pressure in the system at all? With compute taking this long,
the added overhead of disk and network IO is quite minimal.

Thanks!

...Eric Jonas

disk-backing pyspark rdds?

Reply via email to