[ https://issues.apache.org/jira/browse/SPARK-3631?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14143484#comment-14143484 ]
Burak Yavuz commented on SPARK-3631: ------------------------------------ Thanks for setting this up [~aash]! [~pwendell], [~tdas], [~joshrosen] could you please confirm/correct/add to my explanation above. Thanks! > Add docs for checkpoint usage > ----------------------------- > > Key: SPARK-3631 > URL: https://issues.apache.org/jira/browse/SPARK-3631 > Project: Spark > Issue Type: Documentation > Components: Documentation > Affects Versions: 1.1.0 > Reporter: Andrew Ash > Assignee: Andrew Ash > > We should include general documentation on using checkpoints. Right now the > docs only cover checkpoints in the Spark Streaming use case which is slightly > different from Core. > Some content to consider for inclusion from [~brkyvz]: > {quote} > If you set the checkpointing directory however, the intermediate state of the > RDDs will be saved in HDFS, and the lineage will pick off from there. > You won't need to keep the shuffle data before the checkpointed state, > therefore those can be safely removed (will be removed automatically). > However, checkpoint must be called explicitly as in > https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/recommendation/ALS.scala#L291 > ,just setting the directory will not be enough. > {quote} > {quote} > Yes, writing to HDFS is more expensive, but I feel it is still a small price > to pay when compared to having a Disk Space Full error three hours in > and having to start from scratch. > The main goal of checkpointing is to truncate the lineage. Clearing up > shuffle writes come as a bonus to checkpointing, it is not the main goal. The > subtlety here is that .checkpoint() is just like .cache(). Until you call an > action, nothing happens. Therefore, if you're going to do 1000 maps in a > row and you don't want to checkpoint in the meantime until a shuffle happens, > you will still get a StackOverflowError, because the lineage is too long. > I went through some of the code for checkpointing. As far as I can tell, it > materializes the data in HDFS, and resets all its dependencies, so you start > a fresh lineage. My understanding would be that checkpointing still should be > done every N operations to reset the lineage. However, an action must be > performed before the lineage grows too long. > {quote} > A good place to put this information would be at > https://spark.apache.org/docs/latest/programming-guide.html -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org