[ 
https://issues.apache.org/jira/browse/SPARK-3631?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14143484#comment-14143484
 ] 

Burak Yavuz commented on SPARK-3631:
------------------------------------

Thanks for setting this up [~aash]! [~pwendell], [~tdas], [~joshrosen] could 
you please confirm/correct/add to my explanation above. Thanks!

> Add docs for checkpoint usage
> -----------------------------
>
>                 Key: SPARK-3631
>                 URL: https://issues.apache.org/jira/browse/SPARK-3631
>             Project: Spark
>          Issue Type: Documentation
>          Components: Documentation
>    Affects Versions: 1.1.0
>            Reporter: Andrew Ash
>            Assignee: Andrew Ash
>
> We should include general documentation on using checkpoints.  Right now the 
> docs only cover checkpoints in the Spark Streaming use case which is slightly 
> different from Core.
> Some content to consider for inclusion from [~brkyvz]:
> {quote}
> If you set the checkpointing directory however, the intermediate state of the 
> RDDs will be saved in HDFS, and the lineage will pick off from there.
> You won't need to keep the shuffle data before the checkpointed state, 
> therefore those can be safely removed (will be removed automatically).
> However, checkpoint must be called explicitly as in 
> https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/recommendation/ALS.scala#L291
>  ,just setting the directory will not be enough.
> {quote}
> {quote}
> Yes, writing to HDFS is more expensive, but I feel it is still a small price 
> to pay when compared to having a Disk Space Full error three hours in
> and having to start from scratch.
> The main goal of checkpointing is to truncate the lineage. Clearing up 
> shuffle writes come as a bonus to checkpointing, it is not the main goal. The
> subtlety here is that .checkpoint() is just like .cache(). Until you call an 
> action, nothing happens. Therefore, if you're going to do 1000 maps in a
> row and you don't want to checkpoint in the meantime until a shuffle happens, 
> you will still get a StackOverflowError, because the lineage is too long.
> I went through some of the code for checkpointing. As far as I can tell, it 
> materializes the data in HDFS, and resets all its dependencies, so you start
> a fresh lineage. My understanding would be that checkpointing still should be 
> done every N operations to reset the lineage. However, an action must be
> performed before the lineage grows too long.
> {quote}
> A good place to put this information would be at 
> https://spark.apache.org/docs/latest/programming-guide.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to