Thanks Adrian, the hint of use updateStateByKey with initialRdd helps a lot!
Adrian Tanase <atan...@adobe.com>于2015年9月17日周四 下午4:50写道: > This section in the streaming guide makes your options pretty clear > > http://spark.apache.org/docs/latest/streaming-programming-guide.html#upgrading-application-code > > > 1. Use 2 versions in parallel, drain the queue up to a point and strat > fresh in the new version, only processing events from that point forward > 1. Note that “up to a point” is specific to you state management > logic, it might mean “user sessions stated after 4 am” NOT “events > received > after 4 am” > 2. Graceful shutdown and saving data to DB, followed by checkpoint > cleanup / new checkpoint dir > 1. On restat, you need to use the updateStateByKey that takes an > initialRdd with the values preloaded from DB > 2. By cleaning the checkpoint in between upgrades, data is loaded > only once > > Hope this helps, > -adrian > > From: Bin Wang > Date: Thursday, September 17, 2015 at 11:27 AM > To: Akhil Das > Cc: user > Subject: Re: How to recovery DStream from checkpoint directory? > > In my understand, here I have only three options to keep the DStream state > between redeploys (yes, I'm using updateStateByKey): > > 1. Use checkpoint. > 2. Use my own database. > 3. Use both. > > But none of these options are great: > > 1. Use checkpoint: I cannot load it after code change. Or I need to keep > the structure of the classes, which seems to be impossible in a developing > project. > 2. Use my own database: there may be failure between the program read data > from Kafka and save the DStream to database. So there may have data lose. > 3. Use both: Will the data load two times? How can I know in which > situation I should use the which one? > > The power of checkpoint seems to be very limited. Is there any plan to > support checkpoint while class is changed, like the discussion you gave me > pointed out? > > > > Akhil Das <ak...@sigmoidanalytics.com>于2015年9月17日周四 下午3:26写道: > >> Any kind of changes to the jvm classes will make it fail. By >> checkpointing the data you mean using checkpoint with updateStateByKey? >> Here's a similar discussion happened earlier which will clear your doubts i >> guess >> http://mail-archives.us.apache.org/mod_mbox/spark-user/201507.mbox/%3CCA+AHuK=xoy8dsdaobmgm935goqytaaqkpqsvdaqpmojottj...@mail.gmail.com%3E >> >> Thanks >> Best Regards >> >> On Thu, Sep 17, 2015 at 10:01 AM, Bin Wang <wbi...@gmail.com> wrote: >> >>> And here is another question. If I load the DStream from database every >>> time I start the job, will the data be loaded when the job is failed and >>> auto restart? If so, both the checkpoint data and database data are loaded, >>> won't this a problem? >>> >>> >>> >>> Bin Wang <wbi...@gmail.com>于2015年9月16日周三 下午8:40写道: >>> >>>> Will StreamingContex.getOrCreate do this work?What kind of code change >>>> will make it cannot load? >>>> >>>> Akhil Das <ak...@sigmoidanalytics.com>于2015年9月16日周三 20:20写道: >>>> >>>>> You can't really recover from checkpoint if you alter the code. A >>>>> better approach would be to use some sort of external storage (like a db >>>>> or >>>>> zookeeper etc) to keep the state (the indexes etc) and then when you >>>>> deploy >>>>> new code they can be easily recovered. >>>>> >>>>> Thanks >>>>> Best Regards >>>>> >>>>> On Wed, Sep 16, 2015 at 3:52 PM, Bin Wang <wbi...@gmail.com> wrote: >>>>> >>>>>> I'd like to know if there is a way to recovery dstream from >>>>>> checkpoint. >>>>>> >>>>>> Because I stores state in DStream, I'd like the state to be recovered >>>>>> when I restart the application and deploy new code. >>>>>> >>>>> >>>>> >>