Hi,

I know if we call persist with the right options, we can have Spark persist
an RDD's data on disk.

I am wondering what happens in intermediate operations that could
conceivably create large collections/Sequences, like GroupBy and shuffling.

Basically, one part of the question is when is disk used internally?

And is calling persist() on the RDD returned by such transformations what
let's it know to use disk in those situations? Trying to understand if
persist() is applied during the transformation or after it.

Thank you.


SUREN HIRAMAN, VP TECHNOLOGY
Velos
Accelerating Machine Learning

440 NINTH AVENUE, 11TH FLOOR
NEW YORK, NY 10001
O: (917) 525-2466 ext. 105
F: 646.349.4063
E: suren.hiraman@v <suren.hira...@sociocast.com>elos.io
W: www.velos.io

Reply via email to