This section in the streaming guide makes your options pretty clear
http://spark.apache.org/docs/latest/streaming-programming-guide.html#upgrading-application-code


  1.  Use 2 versions in parallel, drain the queue up to a point and strat fresh 
in the new version, only processing events from that point forward
     *   Note that “up to a point” is specific to you state management logic, 
it might mean “user sessions stated after 4 am” NOT “events received after 4 am”
  2.  Graceful shutdown and saving data to DB, followed by checkpoint cleanup / 
new checkpoint dir
     *   On restat, you need to use the updateStateByKey that takes an 
initialRdd with the values preloaded from DB
     *   By cleaning the checkpoint in between upgrades, data is loaded only 
once

Hope this helps,
-adrian

From: Bin Wang
Date: Thursday, September 17, 2015 at 11:27 AM
To: Akhil Das
Cc: user
Subject: Re: How to recovery DStream from checkpoint directory?

In my understand, here I have only three options to keep the DStream state 
between redeploys (yes, I'm using updateStateByKey):

1. Use checkpoint.
2. Use my own database.
3. Use both.

But none of  these options are great:

1. Use checkpoint: I cannot load it after code change. Or I need to keep the 
structure of the classes, which seems to be impossible in a developing project.
2. Use my own database: there may be failure between the program read data from 
Kafka and save the DStream to database. So there may have data lose.
3. Use both: Will the data load two times? How can I know in which situation I 
should use the which one?

The power of checkpoint seems to be very limited. Is there any plan to support 
checkpoint while class is changed, like the discussion you gave me pointed out?



Akhil Das 
<ak...@sigmoidanalytics.com<mailto:ak...@sigmoidanalytics.com>>于2015年9月17日周四 
下午3:26写道:
Any kind of changes to the jvm classes will make it fail. By checkpointing the 
data you mean using checkpoint with updateStateByKey? Here's a similar 
discussion happened earlier which will clear your doubts i guess 
http://mail-archives.us.apache.org/mod_mbox/spark-user/201507.mbox/%3CCA+AHuK=xoy8dsdaobmgm935goqytaaqkpqsvdaqpmojottj...@mail.gmail.com%3E

Thanks
Best Regards

On Thu, Sep 17, 2015 at 10:01 AM, Bin Wang 
<wbi...@gmail.com<mailto:wbi...@gmail.com>> wrote:
And here is another question. If I load the DStream from database every time I 
start the job, will the data be loaded when the job is failed and auto restart? 
If so, both the checkpoint data and database data are loaded, won't this a 
problem?



Bin Wang <wbi...@gmail.com<mailto:wbi...@gmail.com>>于2015年9月16日周三 下午8:40写道:

Will StreamingContex.getOrCreate do this work?What kind of code change will 
make it cannot load?

Akhil Das 
<ak...@sigmoidanalytics.com<mailto:ak...@sigmoidanalytics.com>>于2015年9月16日周三 
20:20写道:
You can't really recover from checkpoint if you alter the code. A better 
approach would be to use some sort of external storage (like a db or zookeeper 
etc) to keep the state (the indexes etc) and then when you deploy new code they 
can be easily recovered.

Thanks
Best Regards

On Wed, Sep 16, 2015 at 3:52 PM, Bin Wang 
<wbi...@gmail.com<mailto:wbi...@gmail.com>> wrote:
I'd like to know if there is a way to recovery dstream from checkpoint.

Because I stores state in DStream, I'd like the state to be recovered when I 
restart the application and deploy new code.


Reply via email to