Re: Difference between Checkpointing and Persist

2019-04-19 Thread Gene Pang
Hi Subash, I'm not sure how the checkpointing works, but with StorageLevel.MEMORY_AND_DISK, Spark will store the RDD in on-heap memory, and spill to disk if necessary. However, the data is only usable by that Spark job. Saving the RDD will write the data out to an external storage system, like

Re: Difference between Checkpointing and Persist

2019-04-18 Thread Vadim Semenov
saving/checkpointing would be preferable in case of a big data set because: - the RDD gets saved to HDFS and the DAG gets truncated so if some partitions/executors fail it won't result in recomputing everything - you don't use memory for caching therefore the JVM heap is going to be smaller

Re: Difference between Checkpointing and Persist

2019-04-18 Thread Jack Kolokasis
Hi,     in my point of view a good approach is first persist your data in StorageLevel.Memory_And_Disk and then perform join. This will accelerate your computation because data will be presented in memory and in your local intermediate storage device. --Iacovos On 4/18/19 8:49 PM, Subash

Difference between Checkpointing and Persist

2019-04-18 Thread Subash Prabakar
Hi All, I have a doubt about checkpointing and persist/saving. Say we have one RDD - containing huge data, 1. We checkpoint and perform join 2. We persist as StorageLevel.MEMORY_AND_DISK and perform join 3. We save that intermediate RDD and perform join (using same RDD - saving is to just