Thanks Shixiong! Your response helped me to understand the role of
persist(). No persist() calls were required indeed. I solved my problem by
setting spark.local.dir to allow more space for Spark temporary folder. It
works automatically. I am seeing logs like this:
Not enough space to cache rd
Spark won't store RDDs to memory unless you use a memory StorageLevel. By
default, your input and intermediate results won't be put into memory. You
can call persist if you want to avoid duplicate computation or reading.
E.g.,
val r1 = context.wholeTextFiles(...)
val r2 = r1.flatMap(s -> ...)
val
Is there a way how to change the default storage level?
If not, how can I properly change the storage level wherever necessary, if
my input and intermediate results do not fit into memory?
In this example:
context.wholeTextFiles(...)
.flatMap(s -> ...)
.flatMap(s -> ...)
Does persist()