Re: skipping ahead in RDD

Tathagata Das Wed, 26 Feb 2014 16:45:28 -0800

If you are doing a computation where the result at time T depends on all
the previous data till T, then Spark Streaming will automatically ask you
to checkpoint the RDDs generated through Spark Streaming periodically.
Checkpointing means saving the RDD to HDFS (or HDFS compatible system). Say
the checkpoint interval is 1 minute. This means that if there is a failure,
only the RDDs generated in last 1 minute may have to recomputed.


TD


On Wed, Feb 26, 2014 at 4:30 PM, Mayur Rustagi <mayur.rust...@gmail.com>wrote:

> You can checkpoint & itll stop the lineage to only updates after the
> checkpoint.
> Regards
> Mayur
>
> Mayur Rustagi
> Ph: +919632149971
> h <https://twitter.com/mayur_rustagi>ttp://www.sigmoidanalytics.com
> https://twitter.com/mayur_rustagi
>
>
>
> On Wed, Feb 26, 2014 at 1:23 PM, Adrian Mocanu 
> <amoc...@verticalscope.com>wrote:
>
>>  Hi
>>
>> Scenario: Say I've been streaming tuples with Spark for 24 hours and one
>> of the nodes fails.
>>
>> The RDD will be recomputed on the other  Spark nodes and the streaming
>> continues.
>>
>>
>>
>> I'm interested to know how I can skip the first 23 hours and jump in the
>> stream to the last hour. Is this possible?
>>
>>
>>
>> -Adrian
>>
>>
>>
>
>

Re: skipping ahead in RDD

Reply via email to