[Spark Streaming] Iterative programming on an ordered spark stream using Java?

Nipun Arora Thu, 18 Jun 2015 03:37:15 -0700

Hi All,

I am updating my question so that I give more detail. I have also created a
stackexchange question:
http://stackoverflow.com/questions/30904244/iterative-programming-on-an-ordered-spark-stream-using-java-in-spark-streaming

Is there anyway in spark streaming to keep data across multiple
micro-batches of a sorted dstream, where the stream is sorted using
timestamps? (Assuming monotonically arriving data) Can anyone make
suggestions on how to keep data across iterations where each iteration is
an RDD being processed in JavaDStream?

*What does iteration mean?*

I first sort the dstream using timestamps, assuming that data has arrived
in a monotonically increasing timestamp (no out-of-order).

I need a global HashMap X, which I would like to be updated using values
with timestamp "t1", and then subsequently "t1+1". Since the state of X
itself impacts the calculations it needs to be a linear operation. Hence
operation at "t1+1" depends on HashMap X, which depends on data at and
before "t1".

*Application*

This is especially the case when one is trying to update a model or compare
two sets of RDD's, or keep a global history of certain events etc which
will impact operations in future iterations?

I would like to keep some accumulated history to make calculations.. not
the entire dataset, but persist certain events which can be used in future
DStream RDDs?

Thanks
Nipun

On Wed, Jun 17, 2015 at 11:15 PM, Nipun Arora <nipunarora2...@gmail.com>
wrote:

> Hi Silvio,
>
> Thanks for your response.
> I should clarify. I would like to do updates on a structure iteratively. I
> am not sure if updateStateByKey meets my criteria.
>
> In the current situation, I can run some map reduce tasks and generate a
> JavaPairDStream<Key,Value>, after this my algorithm is necessarily
> sequential ... i.e. I have sorted the data using the timestamp(within the
> messages), and I would like to iterate over it, and maintain a state where
> I can update a model.
>
> I tried using foreach/foreachRDD, and collect to do this, but I can't seem
> to propagate values across microbatches/RDD's.
>
> Any suggestions?
>
> Thanks
> Nipun
>
>
>
> On Wed, Jun 17, 2015 at 10:52 PM, Silvio Fiorito <
> silvio.fior...@granturing.com> wrote:
>
>>   Hi, just answered in your other thread as well...
>>
>>  Depending on your requirements, you can look at the updateStateByKey API
>>
>>   From: Nipun Arora
>> Date: Wednesday, June 17, 2015 at 10:51 PM
>> To: "user@spark.apache.org"
>> Subject: Iterative Programming by keeping data across micro-batches in
>> spark-streaming?
>>
>>   Hi,
>>
>>  Is there anyway in spark streaming to keep data across multiple
>> micro-batches? Like in a HashMap or something?
>> Can anyone make suggestions on how to keep data across iterations where
>> each iteration is an RDD being processed in JavaDStream?
>>
>> This is especially the case when I am trying to update a model or compare
>> two sets of RDD's, or keep a global history of certain events etc which
>> will impact operations in future iterations?
>> I would like to keep some accumulated history to make calculations.. not
>> the entire dataset, but persist certain events which can be used in future
>> JavaDStream RDDs?
>>
>>  Thanks
>> Nipun
>>
>
>

[Spark Streaming] Iterative programming on an ordered spark stream using Java?

Reply via email to