BTW for what it’s worth I agree this is a good option to add, the only tricky thing will be making sure the checkpoint blocks are not garbage-collected by the block store. I don’t think they will be though.
Matei On May 17, 2014, at 2:20 PM, Matei Zaharia <matei.zaha...@gmail.com> wrote: > We do actually have replicated StorageLevels in Spark. You can use > MEMORY_AND_DISK_2 or construct your own StorageLevel with your own custom > replication factor. > > BTW you guys should probably have this discussion on the JIRA rather than the > dev list; I think the replies somehow ended up on the dev list. > > Matei > > On May 17, 2014, at 1:36 AM, Mridul Muralidharan <mri...@gmail.com> wrote: > >> We don't have 3x replication in spark :-) >> And if we use replicated storagelevel, while decreasing odds of failure, it >> does not eliminate it (since we are not doing a great job with replication >> anyway from fault tolerance point of view). >> Also it does take a nontrivial performance hit with replicated levels. >> >> Regards, >> Mridul >> On 17-May-2014 8:16 am, "Xiangrui Meng" <men...@gmail.com> wrote: >> >>> With 3x replication, we should be able to achieve fault tolerance. >>> This checkPointed RDD can be cleared if we have another in-memory >>> checkPointed RDD down the line. It can avoid hitting disk if we have >>> enough memory to use. We need to investigate more to find a good >>> solution. -Xiangrui >>> >>> On Fri, May 16, 2014 at 4:00 PM, Mridul Muralidharan <mri...@gmail.com> >>> wrote: >>>> Effectively this is persist without fault tolerance. >>>> Failure of any node means complete lack of fault tolerance. >>>> I would be very skeptical of truncating lineage if it is not reliable. >>>> On 17-May-2014 3:49 am, "Xiangrui Meng (JIRA)" <j...@apache.org> wrote: >>>> >>>>> Xiangrui Meng created SPARK-1855: >>>>> ------------------------------------ >>>>> >>>>> Summary: Provide memory-and-local-disk RDD checkpointing >>>>> Key: SPARK-1855 >>>>> URL: https://issues.apache.org/jira/browse/SPARK-1855 >>>>> Project: Spark >>>>> Issue Type: New Feature >>>>> Components: MLlib, Spark Core >>>>> Affects Versions: 1.0.0 >>>>> Reporter: Xiangrui Meng >>>>> >>>>> >>>>> Checkpointing is used to cut long lineage while maintaining fault >>>>> tolerance. The current implementation is HDFS-based. Using the BlockRDD >>> we >>>>> can create in-memory-and-local-disk (with replication) checkpoints that >>> are >>>>> not as reliable as HDFS-based solution but faster. >>>>> >>>>> It can help applications that require many iterations. >>>>> >>>>> >>>>> >>>>> -- >>>>> This message was sent by Atlassian JIRA >>>>> (v6.2#6252) >>>>> >>> >