Re: How can i remove the need for calling cache

jeff saremi Wed, 02 Aug 2017 20:47:40 -0700

thanks Vadim. yes this is a good option for us. thanks

________________________________
From: Vadim Semenov <vadim.seme...@datadoghq.com>
Sent: Wednesday, August 2, 2017 6:24:40 PM
To: Suzen, Mehmet
Cc: jeff saremi; user@spark.apache.org
Subject: Re: How can i remove the need for calling cache

So if you just save an RDD to HDFS via 'saveAsSequenceFile', you would have to 
create a new RDD that reads that data, this way you'll avoid recomputing the 
RDD but may lose time on saving/loading.

Exactly same thing happens in 'checkpoint', 'checkpoint' is just a convenient 
method that gives you the same RDD back, basically.

However, if your job fails, there's no way to run a new job using already 
'checkpointed' data from a previous failed run. That's where having a custom 
check pointer helps.

Another note: you can not delete "checkpoint"ed data in the same job, you need 
to delete it somehow else.

BTW, have you tried '.persist(StorageLevel.DISK_ONLY)'? It caches data to local 
disk, making more space in JVM and letting you to avoid hdfs.

On Wednesday, August 2, 2017, Vadim Semenov 
<vadim.seme...@datadoghq.com<mailto:vadim.seme...@datadoghq.com>> wrote:
`saveAsObjectFile` doesn't save the DAG, it acts as a typical action, so it 
just saves data to some destination.

`cache/persist` allow you to cache data and keep the DAG in case of some 
executor that holds data goes down, so Spark would still be able to recalculate 
missing partitions

`localCheckpoint` allows you to sacrifice fault-tolerance and truncate the DAG, 
so if some executor goes down, the job will fail, because it has already 
forgotten the DAG. 
https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/RDD.scala#L1551-L1610

and `checkpoint` allows you to save data to some shared storage and truncate 
the DAG, so if an executor goes down, the job will be able to take missing 
partitions from the place where it saved the RDD
https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/RDD.scala#L1533-L1549

On Wed, Aug 2, 2017 at 7:20 PM, Suzen, Mehmet 
<su...@acm.org<javascript:_e(%7B%7D,'cvml','su...@acm.org');>> wrote:
On 3 August 2017 at 01:05, jeff saremi 
<jeffsar...@hotmail.com<javascript:_e(%7B%7D,'cvml','jeffsar...@hotmail.com');>>
 wrote:
> Vadim:
>
> This is from the Mastering Spark book:
>
> "It is strongly recommended that a checkpointed RDD is persisted in memory,
> otherwise saving it on a file will require recomputation."

Is this really true? I had the impression that DAG will not be carried
out once RDD is serialized to an external file, so 'saveAsObjectFile'
saves DAG as well?

Re: How can i remove the need for calling cache

Reply via email to