[ 
https://issues.apache.org/jira/browse/SPARK-9947?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14697098#comment-14697098
 ] 

Cody Koeninger commented on SPARK-9947:
---------------------------------------

You already have access to offsets and can save them however you want.  You can 
provide those offsets on restart, regardless of whether checkpointing was 
enabled.

> Separate Metadata and State Checkpoint Data
> -------------------------------------------
>
>                 Key: SPARK-9947
>                 URL: https://issues.apache.org/jira/browse/SPARK-9947
>             Project: Spark
>          Issue Type: Improvement
>          Components: Streaming
>    Affects Versions: 1.4.1
>            Reporter: Dan Dutrow
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> Problem: When updating an application that has checkpointing enabled to 
> support the updateStateByKey and 24/7 operation functionality, you encounter 
> the problem where you might like to maintain state data between restarts but 
> delete the metadata containing execution state. 
> If checkpoint data exists between code redeployment, the program may not 
> execute properly or at all. My current workaround for this issue is to wrap 
> updateStateByKey with my own function that persists the state after every 
> update to my own separate directory. (That allows me to delete the checkpoint 
> with its metadata before redeploying) Then, when I restart the application, I 
> initialize the state with this persisted data. This incurs additional 
> overhead due to persisting of the same data twice: once in the checkpoint and 
> once in my persisted data folder. 
> If Kafka Direct API offsets could be stored in another separate checkpoint 
> directory, that would help address the problem of having to blow that away 
> between code redeployment as well.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to