Hi All! I'm getting my feet wet with pySpark for the fairly boring case of doing parameter sweeps for monte carlo runs. Each of my functions runs for a very long time (2h+) and return numpy arrays on the order of ~100 MB. That is, my spark applications look like
def foo(x): np.random.seed(x) eat_2GB_of_ram() take_2h() return my_100MB_array sc.parallelize(np.arange(100)).map(f).saveAsPickleFile("s3n://blah...") The resulting rdds will most likely not fit in memory but for this use case I don't really care. I know I can persist RDDs, but is there any way to by-default disk-back them (something analogous to mmap?) so that they don't create memory pressure in the system at all? With compute taking this long, the added overhead of disk and network IO is quite minimal. Thanks! ...Eric Jonas