Re: Iterative Programming by keeping data across micro-batches in spark-streaming?

Nipun Arora Wed, 17 Jun 2015 20:15:29 -0700

Hi Silvio,

Thanks for your response.
I should clarify. I would like to do updates on a structure iteratively. I
am not sure if updateStateByKey meets my criteria.


In the current situation, I can run some map reduce tasks and generate a
JavaPairDStream<Key,Value>, after this my algorithm is necessarily
sequential ... i.e. I have sorted the data using the timestamp(within the
messages), and I would like to iterate over it, and maintain a state where
I can update a model.

I tried using foreach/foreachRDD, and collect to do this, but I can't seem
to propagate values across microbatches/RDD's.

Any suggestions?

Thanks
Nipun



On Wed, Jun 17, 2015 at 10:52 PM, Silvio Fiorito <
silvio.fior...@granturing.com> wrote:

>   Hi, just answered in your other thread as well...
>
>  Depending on your requirements, you can look at the updateStateByKey API
>
>   From: Nipun Arora
> Date: Wednesday, June 17, 2015 at 10:51 PM
> To: "user@spark.apache.org"
> Subject: Iterative Programming by keeping data across micro-batches in
> spark-streaming?
>
>   Hi,
>
>  Is there anyway in spark streaming to keep data across multiple
> micro-batches? Like in a HashMap or something?
> Can anyone make suggestions on how to keep data across iterations where
> each iteration is an RDD being processed in JavaDStream?
>
> This is especially the case when I am trying to update a model or compare
> two sets of RDD's, or keep a global history of certain events etc which
> will impact operations in future iterations?
> I would like to keep some accumulated history to make calculations.. not
> the entire dataset, but persist certain events which can be used in future
> JavaDStream RDDs?
>
>  Thanks
> Nipun
>

Re: Iterative Programming by keeping data across micro-batches in spark-streaming?

Reply via email to