so when do we ever need to persist RDD on disk? given that we don't need to worry about RAM(memory) as virtual memory will just push pages to the disk when memory becomes scarce.
On Tue, Aug 23, 2016 11:23 AM, srikanth.je...@gmail.com wrote: Hi Kant Kodali, Based on the input parameter to persist() method either it will be cached on memory or persisted to disk. In case of failures Spark will reconstruct the RDD on a different executor based on the DAG. That is how failures are handled. Spark Core does not replicate the RDDs as they can be reconstructed from the source (let’s say HDFS, Hive or S3 etc.) but not from memory (which is lost already). Thanks, Sreekanth Jella From: kant kodali Sent: Tuesday, August 23, 2016 2:12 PM To: user@spark.apache.org Subject: Are RDD's ever persisted to disk? I am new to spark and I keep hearing that RDD's can be persisted to memory or disk after each checkpoint. I wonder why RDD's are persisted in memory? In case of node failure how would you access memory to reconstruct the RDD? persisting to disk make sense because its like persisting to a Network file system (in case of HDFS) where a each block will have multiple copies across nodes so if a node goes down RDD's can still be reconstructed by the reading the required block from other nodes and recomputing it but my biggest question is Are RDD's ever persisted to disk?