[ https://issues.apache.org/jira/browse/SPARK-9947?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Cody Koeninger closed SPARK-9947. --------------------------------- Resolution: Won't Fix The direct DStream api already gives access to offsets, and it seems clear that most future work on streaming checkpointing is going to be focused on structured streaming. SPARK-15406 > Separate Metadata and State Checkpoint Data > ------------------------------------------- > > Key: SPARK-9947 > URL: https://issues.apache.org/jira/browse/SPARK-9947 > Project: Spark > Issue Type: Improvement > Components: Streaming > Affects Versions: 1.4.1 > Reporter: Dan Dutrow > Original Estimate: 168h > Remaining Estimate: 168h > > Problem: When updating an application that has checkpointing enabled to > support the updateStateByKey and 24/7 operation functionality, you encounter > the problem where you might like to maintain state data between restarts but > delete the metadata containing execution state. > If checkpoint data exists between code redeployment, the program may not > execute properly or at all. My current workaround for this issue is to wrap > updateStateByKey with my own function that persists the state after every > update to my own separate directory. (That allows me to delete the checkpoint > with its metadata before redeploying) Then, when I restart the application, I > initialize the state with this persisted data. This incurs additional > overhead due to persisting of the same data twice: once in the checkpoint and > once in my persisted data folder. > If Kafka Direct API offsets could be stored in another separate checkpoint > directory, that would help address the problem of having to blow that away > between code redeployment as well. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org