saving/checkpointing would be preferable in case of a big data set because:

- the RDD gets saved to HDFS and the DAG gets truncated so if some
partitions/executors fail it won't result in recomputing everything

- you don't use memory for caching therefore the JVM heap is going to be
smaller which helps GC and overall there'll be more memory for other
operations

- by saving to HDFS you're removing potential hotspots since partitions can
be fetched from many DataNodes vs when you get a hot partition that gets
requested a lot by other executors you may end up with an overwhelmed
executor

> We save that intermediate RDD and perform join (using same RDD - saving
is to just persist intermediate result before joining)
Checkpointing is essentially saving the RDD and reading it back, however
you can't read checkpointed data if the job failed so it'd be nice to have
one part of the join saved in case of potential issues.

Overall, in my opinion, when working with big joins you should pay more
attention to reliability and fault-tolerance rather than pure speed as the
probability of having issues grows with increasing the dataset size and
cluster size.

On Thu, Apr 18, 2019 at 1:49 PM Subash Prabakar <subashpraba...@gmail.com>
wrote:

> Hi All,
>
> I have a doubt about checkpointing and persist/saving.
>
> Say we have one RDD - containing huge data,
> 1. We checkpoint and perform join
> 2. We persist as StorageLevel.MEMORY_AND_DISK and perform join
> 3. We save that intermediate RDD and perform join (using same RDD - saving
> is to just persist intermediate result before joining)
>
>
> Which of the above is faster and whats the difference?
>
>
> Thanks,
> Subash
>


-- 
Sent from my iPhone

Reply via email to