[jira] [Updated] (SPARK-3631) Add docs for checkpoint usage
[ https://issues.apache.org/jira/browse/SPARK-3631?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-3631: Labels: bulk-closed (was: ) > Add docs for checkpoint usage > - > > Key: SPARK-3631 > URL: https://issues.apache.org/jira/browse/SPARK-3631 > Project: Spark > Issue Type: Documentation > Components: Documentation >Affects Versions: 1.1.0 >Reporter: Andrew Ash >Assignee: Andrew Ash >Priority: Major > Labels: bulk-closed > > We should include general documentation on using checkpoints. Right now the > docs only cover checkpoints in the Spark Streaming use case which is slightly > different from Core. > Some content to consider for inclusion from [~brkyvz]: > {quote} > If you set the checkpointing directory however, the intermediate state of the > RDDs will be saved in HDFS, and the lineage will pick off from there. > You won't need to keep the shuffle data before the checkpointed state, > therefore those can be safely removed (will be removed automatically). > However, checkpoint must be called explicitly as in > https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/recommendation/ALS.scala#L291 > ,just setting the directory will not be enough. > {quote} > {quote} > Yes, writing to HDFS is more expensive, but I feel it is still a small price > to pay when compared to having a Disk Space Full error three hours in > and having to start from scratch. > The main goal of checkpointing is to truncate the lineage. Clearing up > shuffle writes come as a bonus to checkpointing, it is not the main goal. The > subtlety here is that .checkpoint() is just like .cache(). Until you call an > action, nothing happens. Therefore, if you're going to do 1000 maps in a > row and you don't want to checkpoint in the meantime until a shuffle happens, > you will still get a StackOverflowError, because the lineage is too long. > I went through some of the code for checkpointing. As far as I can tell, it > materializes the data in HDFS, and resets all its dependencies, so you start > a fresh lineage. My understanding would be that checkpointing still should be > done every N operations to reset the lineage. However, an action must be > performed before the lineage grows too long. > {quote} > A good place to put this information would be at > https://spark.apache.org/docs/latest/programming-guide.html -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3631) Add docs for checkpoint usage
[ https://issues.apache.org/jira/browse/SPARK-3631?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-3631: - Target Version/s: (was: 1.2.0) > Add docs for checkpoint usage > - > > Key: SPARK-3631 > URL: https://issues.apache.org/jira/browse/SPARK-3631 > Project: Spark > Issue Type: Documentation > Components: Documentation >Affects Versions: 1.1.0 >Reporter: Andrew Ash >Assignee: Andrew Ash > > We should include general documentation on using checkpoints. Right now the > docs only cover checkpoints in the Spark Streaming use case which is slightly > different from Core. > Some content to consider for inclusion from [~brkyvz]: > {quote} > If you set the checkpointing directory however, the intermediate state of the > RDDs will be saved in HDFS, and the lineage will pick off from there. > You won't need to keep the shuffle data before the checkpointed state, > therefore those can be safely removed (will be removed automatically). > However, checkpoint must be called explicitly as in > https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/recommendation/ALS.scala#L291 > ,just setting the directory will not be enough. > {quote} > {quote} > Yes, writing to HDFS is more expensive, but I feel it is still a small price > to pay when compared to having a Disk Space Full error three hours in > and having to start from scratch. > The main goal of checkpointing is to truncate the lineage. Clearing up > shuffle writes come as a bonus to checkpointing, it is not the main goal. The > subtlety here is that .checkpoint() is just like .cache(). Until you call an > action, nothing happens. Therefore, if you're going to do 1000 maps in a > row and you don't want to checkpoint in the meantime until a shuffle happens, > you will still get a StackOverflowError, because the lineage is too long. > I went through some of the code for checkpointing. As far as I can tell, it > materializes the data in HDFS, and resets all its dependencies, so you start > a fresh lineage. My understanding would be that checkpointing still should be > done every N operations to reset the lineage. However, an action must be > performed before the lineage grows too long. > {quote} > A good place to put this information would be at > https://spark.apache.org/docs/latest/programming-guide.html -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org