[jira] [Updated] (SPARK-3631) Add docs for checkpoint usage

2019-05-20 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-3631?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-3631:

Labels: bulk-closed  (was: )

> Add docs for checkpoint usage
> -
>
> Key: SPARK-3631
> URL: https://issues.apache.org/jira/browse/SPARK-3631
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation
>Affects Versions: 1.1.0
>Reporter: Andrew Ash
>Assignee: Andrew Ash
>Priority: Major
>  Labels: bulk-closed
>
> We should include general documentation on using checkpoints.  Right now the 
> docs only cover checkpoints in the Spark Streaming use case which is slightly 
> different from Core.
> Some content to consider for inclusion from [~brkyvz]:
> {quote}
> If you set the checkpointing directory however, the intermediate state of the 
> RDDs will be saved in HDFS, and the lineage will pick off from there.
> You won't need to keep the shuffle data before the checkpointed state, 
> therefore those can be safely removed (will be removed automatically).
> However, checkpoint must be called explicitly as in 
> https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/recommendation/ALS.scala#L291
>  ,just setting the directory will not be enough.
> {quote}
> {quote}
> Yes, writing to HDFS is more expensive, but I feel it is still a small price 
> to pay when compared to having a Disk Space Full error three hours in
> and having to start from scratch.
> The main goal of checkpointing is to truncate the lineage. Clearing up 
> shuffle writes come as a bonus to checkpointing, it is not the main goal. The
> subtlety here is that .checkpoint() is just like .cache(). Until you call an 
> action, nothing happens. Therefore, if you're going to do 1000 maps in a
> row and you don't want to checkpoint in the meantime until a shuffle happens, 
> you will still get a StackOverflowError, because the lineage is too long.
> I went through some of the code for checkpointing. As far as I can tell, it 
> materializes the data in HDFS, and resets all its dependencies, so you start
> a fresh lineage. My understanding would be that checkpointing still should be 
> done every N operations to reset the lineage. However, an action must be
> performed before the lineage grows too long.
> {quote}
> A good place to put this information would be at 
> https://spark.apache.org/docs/latest/programming-guide.html



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3631) Add docs for checkpoint usage

2015-05-05 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3631?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-3631:
-
Target Version/s:   (was: 1.2.0)

> Add docs for checkpoint usage
> -
>
> Key: SPARK-3631
> URL: https://issues.apache.org/jira/browse/SPARK-3631
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation
>Affects Versions: 1.1.0
>Reporter: Andrew Ash
>Assignee: Andrew Ash
>
> We should include general documentation on using checkpoints.  Right now the 
> docs only cover checkpoints in the Spark Streaming use case which is slightly 
> different from Core.
> Some content to consider for inclusion from [~brkyvz]:
> {quote}
> If you set the checkpointing directory however, the intermediate state of the 
> RDDs will be saved in HDFS, and the lineage will pick off from there.
> You won't need to keep the shuffle data before the checkpointed state, 
> therefore those can be safely removed (will be removed automatically).
> However, checkpoint must be called explicitly as in 
> https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/recommendation/ALS.scala#L291
>  ,just setting the directory will not be enough.
> {quote}
> {quote}
> Yes, writing to HDFS is more expensive, but I feel it is still a small price 
> to pay when compared to having a Disk Space Full error three hours in
> and having to start from scratch.
> The main goal of checkpointing is to truncate the lineage. Clearing up 
> shuffle writes come as a bonus to checkpointing, it is not the main goal. The
> subtlety here is that .checkpoint() is just like .cache(). Until you call an 
> action, nothing happens. Therefore, if you're going to do 1000 maps in a
> row and you don't want to checkpoint in the meantime until a shuffle happens, 
> you will still get a StackOverflowError, because the lineage is too long.
> I went through some of the code for checkpointing. As far as I can tell, it 
> materializes the data in HDFS, and resets all its dependencies, so you start
> a fresh lineage. My understanding would be that checkpointing still should be 
> done every N operations to reset the lineage. However, an action must be
> performed before the lineage grows too long.
> {quote}
> A good place to put this information would be at 
> https://spark.apache.org/docs/latest/programming-guide.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org