thanks Vadim. yes this is a good option for us. thanks ________________________________ From: Vadim Semenov <vadim.seme...@datadoghq.com> Sent: Wednesday, August 2, 2017 6:24:40 PM To: Suzen, Mehmet Cc: jeff saremi; user@spark.apache.org Subject: Re: How can i remove the need for calling cache
So if you just save an RDD to HDFS via 'saveAsSequenceFile', you would have to create a new RDD that reads that data, this way you'll avoid recomputing the RDD but may lose time on saving/loading. Exactly same thing happens in 'checkpoint', 'checkpoint' is just a convenient method that gives you the same RDD back, basically. However, if your job fails, there's no way to run a new job using already 'checkpointed' data from a previous failed run. That's where having a custom check pointer helps. Another note: you can not delete "checkpoint"ed data in the same job, you need to delete it somehow else. BTW, have you tried '.persist(StorageLevel.DISK_ONLY)'? It caches data to local disk, making more space in JVM and letting you to avoid hdfs. On Wednesday, August 2, 2017, Vadim Semenov <vadim.seme...@datadoghq.com<mailto:vadim.seme...@datadoghq.com>> wrote: `saveAsObjectFile` doesn't save the DAG, it acts as a typical action, so it just saves data to some destination. `cache/persist` allow you to cache data and keep the DAG in case of some executor that holds data goes down, so Spark would still be able to recalculate missing partitions `localCheckpoint` allows you to sacrifice fault-tolerance and truncate the DAG, so if some executor goes down, the job will fail, because it has already forgotten the DAG. https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/RDD.scala#L1551-L1610 and `checkpoint` allows you to save data to some shared storage and truncate the DAG, so if an executor goes down, the job will be able to take missing partitions from the place where it saved the RDD https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/RDD.scala#L1533-L1549 On Wed, Aug 2, 2017 at 7:20 PM, Suzen, Mehmet <su...@acm.org<javascript:_e(%7B%7D,'cvml','su...@acm.org');>> wrote: On 3 August 2017 at 01:05, jeff saremi <jeffsar...@hotmail.com<javascript:_e(%7B%7D,'cvml','jeffsar...@hotmail.com');>> wrote: > Vadim: > > This is from the Mastering Spark book: > > "It is strongly recommended that a checkpointed RDD is persisted in memory, > otherwise saving it on a file will require recomputation." Is this really true? I had the impression that DAG will not be carried out once RDD is serialized to an external file, so 'saveAsObjectFile' saves DAG as well?