Re: How to recovery DStream from checkpoint directory?

2015-09-17 Thread Akhil Das
Any kind of changes to the jvm classes will make it fail. By checkpointing
the data you mean using checkpoint with updateStateByKey? Here's a similar
discussion happened earlier which will clear your doubts i guess
http://mail-archives.us.apache.org/mod_mbox/spark-user/201507.mbox/%3CCA+AHuK=xoy8dsdaobmgm935goqytaaqkpqsvdaqpmojottj...@mail.gmail.com%3E

Thanks
Best Regards

On Thu, Sep 17, 2015 at 10:01 AM, Bin Wang  wrote:

> And here is another question. If I load the DStream from database every
> time I start the job, will the data be loaded when the job is failed and
> auto restart? If so, both the checkpoint data and database data are loaded,
> won't this a problem?
>
>
>
> Bin Wang 于2015年9月16日周三 下午8:40写道:
>
>> Will StreamingContex.getOrCreate do this work?What kind of code change
>> will make it cannot load?
>>
>> Akhil Das 于2015年9月16日周三 20:20写道:
>>
>>> You can't really recover from checkpoint if you alter the code. A better
>>> approach would be to use some sort of external storage (like a db or
>>> zookeeper etc) to keep the state (the indexes etc) and then when you deploy
>>> new code they can be easily recovered.
>>>
>>> Thanks
>>> Best Regards
>>>
>>> On Wed, Sep 16, 2015 at 3:52 PM, Bin Wang  wrote:
>>>
 I'd like to know if there is a way to recovery dstream from checkpoint.

 Because I stores state in DStream, I'd like the state to be recovered
 when I restart the application and deploy new code.

>>>
>>>


Re: How to recovery DStream from checkpoint directory?

2015-09-17 Thread Bin Wang
In my understand, here I have only three options to keep the DStream state
between redeploys (yes, I'm using updateStateByKey):

1. Use checkpoint.
2. Use my own database.
3. Use both.

But none of  these options are great:

1. Use checkpoint: I cannot load it after code change. Or I need to keep
the structure of the classes, which seems to be impossible in a developing
project.
2. Use my own database: there may be failure between the program read data
from Kafka and save the DStream to database. So there may have data lose.
3. Use both: Will the data load two times? How can I know in which
situation I should use the which one?

The power of checkpoint seems to be very limited. Is there any plan to
support checkpoint while class is changed, like the discussion you gave me
pointed out?



Akhil Das 于2015年9月17日周四 下午3:26写道:

> Any kind of changes to the jvm classes will make it fail. By checkpointing
> the data you mean using checkpoint with updateStateByKey? Here's a similar
> discussion happened earlier which will clear your doubts i guess
> http://mail-archives.us.apache.org/mod_mbox/spark-user/201507.mbox/%3CCA+AHuK=xoy8dsdaobmgm935goqytaaqkpqsvdaqpmojottj...@mail.gmail.com%3E
>
> Thanks
> Best Regards
>
> On Thu, Sep 17, 2015 at 10:01 AM, Bin Wang  wrote:
>
>> And here is another question. If I load the DStream from database every
>> time I start the job, will the data be loaded when the job is failed and
>> auto restart? If so, both the checkpoint data and database data are loaded,
>> won't this a problem?
>>
>>
>>
>> Bin Wang 于2015年9月16日周三 下午8:40写道:
>>
>>> Will StreamingContex.getOrCreate do this work?What kind of code change
>>> will make it cannot load?
>>>
>>> Akhil Das 于2015年9月16日周三 20:20写道:
>>>
 You can't really recover from checkpoint if you alter the code. A
 better approach would be to use some sort of external storage (like a db or
 zookeeper etc) to keep the state (the indexes etc) and then when you deploy
 new code they can be easily recovered.

 Thanks
 Best Regards

 On Wed, Sep 16, 2015 at 3:52 PM, Bin Wang  wrote:

> I'd like to know if there is a way to recovery dstream from checkpoint.
>
> Because I stores state in DStream, I'd like the state to be recovered
> when I restart the application and deploy new code.
>


>


Re: How to recovery DStream from checkpoint directory?

2015-09-17 Thread Adrian Tanase
This section in the streaming guide makes your options pretty clear
http://spark.apache.org/docs/latest/streaming-programming-guide.html#upgrading-application-code


  1.  Use 2 versions in parallel, drain the queue up to a point and strat fresh 
in the new version, only processing events from that point forward
 *   Note that “up to a point” is specific to you state management logic, 
it might mean “user sessions stated after 4 am” NOT “events received after 4 am”
  2.  Graceful shutdown and saving data to DB, followed by checkpoint cleanup / 
new checkpoint dir
 *   On restat, you need to use the updateStateByKey that takes an 
initialRdd with the values preloaded from DB
 *   By cleaning the checkpoint in between upgrades, data is loaded only 
once

Hope this helps,
-adrian

From: Bin Wang
Date: Thursday, September 17, 2015 at 11:27 AM
To: Akhil Das
Cc: user
Subject: Re: How to recovery DStream from checkpoint directory?

In my understand, here I have only three options to keep the DStream state 
between redeploys (yes, I'm using updateStateByKey):

1. Use checkpoint.
2. Use my own database.
3. Use both.

But none of  these options are great:

1. Use checkpoint: I cannot load it after code change. Or I need to keep the 
structure of the classes, which seems to be impossible in a developing project.
2. Use my own database: there may be failure between the program read data from 
Kafka and save the DStream to database. So there may have data lose.
3. Use both: Will the data load two times? How can I know in which situation I 
should use the which one?

The power of checkpoint seems to be very limited. Is there any plan to support 
checkpoint while class is changed, like the discussion you gave me pointed out?



Akhil Das 
<ak...@sigmoidanalytics.com<mailto:ak...@sigmoidanalytics.com>>于2015年9月17日周四 
下午3:26写道:
Any kind of changes to the jvm classes will make it fail. By checkpointing the 
data you mean using checkpoint with updateStateByKey? Here's a similar 
discussion happened earlier which will clear your doubts i guess 
http://mail-archives.us.apache.org/mod_mbox/spark-user/201507.mbox/%3CCA+AHuK=xoy8dsdaobmgm935goqytaaqkpqsvdaqpmojottj...@mail.gmail.com%3E

Thanks
Best Regards

On Thu, Sep 17, 2015 at 10:01 AM, Bin Wang 
<wbi...@gmail.com<mailto:wbi...@gmail.com>> wrote:
And here is another question. If I load the DStream from database every time I 
start the job, will the data be loaded when the job is failed and auto restart? 
If so, both the checkpoint data and database data are loaded, won't this a 
problem?



Bin Wang <wbi...@gmail.com<mailto:wbi...@gmail.com>>于2015年9月16日周三 下午8:40写道:

Will StreamingContex.getOrCreate do this work?What kind of code change will 
make it cannot load?

Akhil Das 
<ak...@sigmoidanalytics.com<mailto:ak...@sigmoidanalytics.com>>于2015年9月16日周三 
20:20写道:
You can't really recover from checkpoint if you alter the code. A better 
approach would be to use some sort of external storage (like a db or zookeeper 
etc) to keep the state (the indexes etc) and then when you deploy new code they 
can be easily recovered.

Thanks
Best Regards

On Wed, Sep 16, 2015 at 3:52 PM, Bin Wang 
<wbi...@gmail.com<mailto:wbi...@gmail.com>> wrote:
I'd like to know if there is a way to recovery dstream from checkpoint.

Because I stores state in DStream, I'd like the state to be recovered when I 
restart the application and deploy new code.




Re: How to recovery DStream from checkpoint directory?

2015-09-17 Thread Bin Wang
Thanks Adrian, the hint of use updateStateByKey with initialRdd helps a lot!

Adrian Tanase <atan...@adobe.com>于2015年9月17日周四 下午4:50写道:

> This section in the streaming guide makes your options pretty clear
>
> http://spark.apache.org/docs/latest/streaming-programming-guide.html#upgrading-application-code
>
>
>1. Use 2 versions in parallel, drain the queue up to a point and strat
>fresh in the new version, only processing events from that point forward
>   1. Note that “up to a point” is specific to you state management
>   logic, it might mean “user sessions stated after 4 am” NOT “events 
> received
>   after 4 am”
>2. Graceful shutdown and saving data to DB, followed by checkpoint
>cleanup / new checkpoint dir
>   1. On restat, you need to use the updateStateByKey that takes an
>   initialRdd with the values preloaded from DB
>   2. By cleaning the checkpoint in between upgrades, data is loaded
>   only once
>
> Hope this helps,
> -adrian
>
> From: Bin Wang
> Date: Thursday, September 17, 2015 at 11:27 AM
> To: Akhil Das
> Cc: user
> Subject: Re: How to recovery DStream from checkpoint directory?
>
> In my understand, here I have only three options to keep the DStream state
> between redeploys (yes, I'm using updateStateByKey):
>
> 1. Use checkpoint.
> 2. Use my own database.
> 3. Use both.
>
> But none of  these options are great:
>
> 1. Use checkpoint: I cannot load it after code change. Or I need to keep
> the structure of the classes, which seems to be impossible in a developing
> project.
> 2. Use my own database: there may be failure between the program read data
> from Kafka and save the DStream to database. So there may have data lose.
> 3. Use both: Will the data load two times? How can I know in which
> situation I should use the which one?
>
> The power of checkpoint seems to be very limited. Is there any plan to
> support checkpoint while class is changed, like the discussion you gave me
> pointed out?
>
>
>
> Akhil Das <ak...@sigmoidanalytics.com>于2015年9月17日周四 下午3:26写道:
>
>> Any kind of changes to the jvm classes will make it fail. By
>> checkpointing the data you mean using checkpoint with updateStateByKey?
>> Here's a similar discussion happened earlier which will clear your doubts i
>> guess
>> http://mail-archives.us.apache.org/mod_mbox/spark-user/201507.mbox/%3CCA+AHuK=xoy8dsdaobmgm935goqytaaqkpqsvdaqpmojottj...@mail.gmail.com%3E
>>
>> Thanks
>> Best Regards
>>
>> On Thu, Sep 17, 2015 at 10:01 AM, Bin Wang <wbi...@gmail.com> wrote:
>>
>>> And here is another question. If I load the DStream from database every
>>> time I start the job, will the data be loaded when the job is failed and
>>> auto restart? If so, both the checkpoint data and database data are loaded,
>>> won't this a problem?
>>>
>>>
>>>
>>> Bin Wang <wbi...@gmail.com>于2015年9月16日周三 下午8:40写道:
>>>
>>>> Will StreamingContex.getOrCreate do this work?What kind of code change
>>>> will make it cannot load?
>>>>
>>>> Akhil Das <ak...@sigmoidanalytics.com>于2015年9月16日周三 20:20写道:
>>>>
>>>>> You can't really recover from checkpoint if you alter the code. A
>>>>> better approach would be to use some sort of external storage (like a db 
>>>>> or
>>>>> zookeeper etc) to keep the state (the indexes etc) and then when you 
>>>>> deploy
>>>>> new code they can be easily recovered.
>>>>>
>>>>> Thanks
>>>>> Best Regards
>>>>>
>>>>> On Wed, Sep 16, 2015 at 3:52 PM, Bin Wang <wbi...@gmail.com> wrote:
>>>>>
>>>>>> I'd like to know if there is a way to recovery dstream from
>>>>>> checkpoint.
>>>>>>
>>>>>> Because I stores state in DStream, I'd like the state to be recovered
>>>>>> when I restart the application and deploy new code.
>>>>>>
>>>>>
>>>>>
>>


Re: How to recovery DStream from checkpoint directory?

2015-09-16 Thread Akhil Das
You can't really recover from checkpoint if you alter the code. A better
approach would be to use some sort of external storage (like a db or
zookeeper etc) to keep the state (the indexes etc) and then when you deploy
new code they can be easily recovered.

Thanks
Best Regards

On Wed, Sep 16, 2015 at 3:52 PM, Bin Wang  wrote:

> I'd like to know if there is a way to recovery dstream from checkpoint.
>
> Because I stores state in DStream, I'd like the state to be recovered when
> I restart the application and deploy new code.
>


Re: How to recovery DStream from checkpoint directory?

2015-09-16 Thread Bin Wang
Will StreamingContex.getOrCreate do this work?What kind of code change will
make it cannot load?

Akhil Das 于2015年9月16日周三 20:20写道:

> You can't really recover from checkpoint if you alter the code. A better
> approach would be to use some sort of external storage (like a db or
> zookeeper etc) to keep the state (the indexes etc) and then when you deploy
> new code they can be easily recovered.
>
> Thanks
> Best Regards
>
> On Wed, Sep 16, 2015 at 3:52 PM, Bin Wang  wrote:
>
>> I'd like to know if there is a way to recovery dstream from checkpoint.
>>
>> Because I stores state in DStream, I'd like the state to be recovered
>> when I restart the application and deploy new code.
>>
>
>


How to recovery DStream from checkpoint directory?

2015-09-16 Thread Bin Wang
I'd like to know if there is a way to recovery dstream from checkpoint.

Because I stores state in DStream, I'd like the state to be recovered when
I restart the application and deploy new code.


Re: How to recovery DStream from checkpoint directory?

2015-09-16 Thread Bin Wang
And here is another question. If I load the DStream from database every
time I start the job, will the data be loaded when the job is failed and
auto restart? If so, both the checkpoint data and database data are loaded,
won't this a problem?



Bin Wang 于2015年9月16日周三 下午8:40写道:

> Will StreamingContex.getOrCreate do this work?What kind of code change
> will make it cannot load?
>
> Akhil Das 于2015年9月16日周三 20:20写道:
>
>> You can't really recover from checkpoint if you alter the code. A better
>> approach would be to use some sort of external storage (like a db or
>> zookeeper etc) to keep the state (the indexes etc) and then when you deploy
>> new code they can be easily recovered.
>>
>> Thanks
>> Best Regards
>>
>> On Wed, Sep 16, 2015 at 3:52 PM, Bin Wang  wrote:
>>
>>> I'd like to know if there is a way to recovery dstream from checkpoint.
>>>
>>> Because I stores state in DStream, I'd like the state to be recovered
>>> when I restart the application and deploy new code.
>>>
>>
>>