Re: Difference between Checkpointing and Persist

2019-04-19 Thread Gene Pang
Hi Subash,

I'm not sure how the checkpointing works, but with
StorageLevel.MEMORY_AND_DISK, Spark will store the RDD in on-heap memory,
and spill to disk if necessary. However, the data is only usable by that
Spark job. Saving the RDD will write the data out to an external storage
system, like HDFS or Alluxio
.

There are some advantages of saving the RDD, mainly allowing different jobs
or even different frameworks to read that data. One possibility is to save
the RDD to Alluxio, which can store the data in-memory, improving the
throughput by avoiding the disk. Here is an article discussing different
ways to store RDDs

.

Thanks,
Gene

On Thu, Apr 18, 2019 at 10:49 AM Subash Prabakar 
wrote:

> Hi All,
>
> I have a doubt about checkpointing and persist/saving.
>
> Say we have one RDD - containing huge data,
> 1. We checkpoint and perform join
> 2. We persist as StorageLevel.MEMORY_AND_DISK and perform join
> 3. We save that intermediate RDD and perform join (using same RDD - saving
> is to just persist intermediate result before joining)
>
>
> Which of the above is faster and whats the difference?
>
>
> Thanks,
> Subash
>


Re: Difference between Checkpointing and Persist

2019-04-18 Thread Vadim Semenov
saving/checkpointing would be preferable in case of a big data set because:

- the RDD gets saved to HDFS and the DAG gets truncated so if some
partitions/executors fail it won't result in recomputing everything

- you don't use memory for caching therefore the JVM heap is going to be
smaller which helps GC and overall there'll be more memory for other
operations

- by saving to HDFS you're removing potential hotspots since partitions can
be fetched from many DataNodes vs when you get a hot partition that gets
requested a lot by other executors you may end up with an overwhelmed
executor

> We save that intermediate RDD and perform join (using same RDD - saving
is to just persist intermediate result before joining)
Checkpointing is essentially saving the RDD and reading it back, however
you can't read checkpointed data if the job failed so it'd be nice to have
one part of the join saved in case of potential issues.

Overall, in my opinion, when working with big joins you should pay more
attention to reliability and fault-tolerance rather than pure speed as the
probability of having issues grows with increasing the dataset size and
cluster size.

On Thu, Apr 18, 2019 at 1:49 PM Subash Prabakar 
wrote:

> Hi All,
>
> I have a doubt about checkpointing and persist/saving.
>
> Say we have one RDD - containing huge data,
> 1. We checkpoint and perform join
> 2. We persist as StorageLevel.MEMORY_AND_DISK and perform join
> 3. We save that intermediate RDD and perform join (using same RDD - saving
> is to just persist intermediate result before joining)
>
>
> Which of the above is faster and whats the difference?
>
>
> Thanks,
> Subash
>


-- 
Sent from my iPhone


Re: Difference between Checkpointing and Persist

2019-04-18 Thread Jack Kolokasis

Hi,

    in my point of view a good approach is first persist your data in 
StorageLevel.Memory_And_Disk and then perform join. This will accelerate 
your computation because data will be presented in memory and in your 
local intermediate storage device.


--Iacovos

On 4/18/19 8:49 PM, Subash Prabakar wrote:

Hi All,

I have a doubt about checkpointing and persist/saving.

Say we have one RDD - containing huge data,
1. We checkpoint and perform join
2. We persist as StorageLevel.MEMORY_AND_DISK and perform join
3. We save that intermediate RDD and perform join (using same RDD - 
saving is to just persist intermediate result before joining)



Which of the above is faster and whats the difference?


Thanks,
Subash


-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Difference between Checkpointing and Persist

2019-04-18 Thread Subash Prabakar
Hi All,

I have a doubt about checkpointing and persist/saving.

Say we have one RDD - containing huge data,
1. We checkpoint and perform join
2. We persist as StorageLevel.MEMORY_AND_DISK and perform join
3. We save that intermediate RDD and perform join (using same RDD - saving
is to just persist intermediate result before joining)


Which of the above is faster and whats the difference?


Thanks,
Subash