Spark won't store RDDs to memory unless you use a memory StorageLevel. By
default, your input and intermediate results won't be put into memory. You
can call persist if you want to avoid duplicate computation or reading.
E.g.,

val r1 = context.wholeTextFiles(...)
val r2 = r1.flatMap(s -> ...)
val r3 = r2.filter(...)...
r3.saveAsTextFile(...)
val r4 = r2.map(...)...
r4.saveAsTextFile(...)

In the avoid example, r2 will be used twice. To speed up the computation,
you can call r2.persist(StorageLevel.MEMORY) to store r2 into memory. Then
r4 will use the data of r2 in memory directly. E.g.,

val r1 = context.wholeTextFiles(...)
val r2 = r1.flatMap(s -> ...)
r2.persist(StorageLevel.MEMORY)
val r3 = r2.filter(...)...
r3.saveAsTextFile(...)
val r4 = r2.map(...)...
r4.saveAsTextFile(...)

See
http://spark.apache.org/docs/latest/programming-guide.html#rdd-persistence


Best Regards,
Shixiong Zhu

2015-07-09 22:09 GMT+08:00 Michal Čizmazia <mici...@gmail.com>:

> Is there a way how to change the default storage level?
>
> If not, how can I properly change the storage level wherever necessary, if
> my input and intermediate results do not fit into memory?
>
> In this example:
>
> context.wholeTextFiles(...)
>     .flatMap(s -> ...)
>     .flatMap(s -> ...)
>
> Does persist() need to be called after every transformation?
>
>  context.wholeTextFiles(...)
>     .persist(StorageLevel.MEMORY_AND_DISK)
>     .flatMap(s -> ...)
>     .persist(StorageLevel.MEMORY_AND_DISK)
>     .flatMap(s -> ...)
>     .persist(StorageLevel.MEMORY_AND_DISK)
>
>  Thanks!
>
>

Reply via email to