Hi,

 UpdateStateByKey : if you can brief the issue you are facing with
this,that will be great.

Regarding not keeping whole dataset in memory, you can tweak the parameter
of remember, such that it does checkpoint at appropriate time.

Thanks
Twinkle

On Thursday, June 18, 2015, Nipun Arora <nipunarora2...@gmail.com> wrote:

> Hi All,
>
> I am updating my question so that I give more detail. I have also created
> a stackexchange question:
> http://stackoverflow.com/questions/30904244/iterative-programming-on-an-ordered-spark-stream-using-java-in-spark-streaming
>
> Is there anyway in spark streaming to keep data across multiple
> micro-batches of a sorted dstream, where the stream is sorted using
> timestamps? (Assuming monotonically arriving data) Can anyone make
> suggestions on how to keep data across iterations where each iteration is
> an RDD being processed in JavaDStream?
>
> *What does iteration mean?*
>
> I first sort the dstream using timestamps, assuming that data has arrived
> in a monotonically increasing timestamp (no out-of-order).
>
> I need a global HashMap X, which I would like to be updated using values
> with timestamp "t1", and then subsequently "t1+1". Since the state of X
> itself impacts the calculations it needs to be a linear operation. Hence
> operation at "t1+1" depends on HashMap X, which depends on data at and
> before "t1".
>
> *Application*
>
> This is especially the case when one is trying to update a model or
> compare two sets of RDD's, or keep a global history of certain events etc
> which will impact operations in future iterations?
>
> I would like to keep some accumulated history to make calculations.. not
> the entire dataset, but persist certain events which can be used in future
> DStream RDDs?
>
> Thanks
> Nipun
>
> On Wed, Jun 17, 2015 at 11:15 PM, Nipun Arora <nipunarora2...@gmail.com
> <javascript:_e(%7B%7D,'cvml','nipunarora2...@gmail.com');>> wrote:
>
>> Hi Silvio,
>>
>> Thanks for your response.
>> I should clarify. I would like to do updates on a structure iteratively.
>> I am not sure if updateStateByKey meets my criteria.
>>
>> In the current situation, I can run some map reduce tasks and generate a
>> JavaPairDStream<Key,Value>, after this my algorithm is necessarily
>> sequential ... i.e. I have sorted the data using the timestamp(within the
>> messages), and I would like to iterate over it, and maintain a state where
>> I can update a model.
>>
>> I tried using foreach/foreachRDD, and collect to do this, but I can't
>> seem to propagate values across microbatches/RDD's.
>>
>> Any suggestions?
>>
>> Thanks
>> Nipun
>>
>>
>>
>> On Wed, Jun 17, 2015 at 10:52 PM, Silvio Fiorito <
>> silvio.fior...@granturing.com
>> <javascript:_e(%7B%7D,'cvml','silvio.fior...@granturing.com');>> wrote:
>>
>>>   Hi, just answered in your other thread as well...
>>>
>>>  Depending on your requirements, you can look at the updateStateByKey
>>> API
>>>
>>>   From: Nipun Arora
>>> Date: Wednesday, June 17, 2015 at 10:51 PM
>>> To: "user@spark.apache.org
>>> <javascript:_e(%7B%7D,'cvml','user@spark.apache.org');>"
>>> Subject: Iterative Programming by keeping data across micro-batches in
>>> spark-streaming?
>>>
>>>   Hi,
>>>
>>>  Is there anyway in spark streaming to keep data across multiple
>>> micro-batches? Like in a HashMap or something?
>>> Can anyone make suggestions on how to keep data across iterations where
>>> each iteration is an RDD being processed in JavaDStream?
>>>
>>> This is especially the case when I am trying to update a model or
>>> compare two sets of RDD's, or keep a global history of certain events etc
>>> which will impact operations in future iterations?
>>> I would like to keep some accumulated history to make calculations.. not
>>> the entire dataset, but persist certain events which can be used in future
>>> JavaDStream RDDs?
>>>
>>>  Thanks
>>> Nipun
>>>
>>
>>
>

Reply via email to