[Spark Streaming] Iterative programming on an ordered spark stream using Java?

2015-06-18 Thread Nipun Arora
Hi All,

I am updating my question so that I give more detail. I have also created a
stackexchange question:
http://stackoverflow.com/questions/30904244/iterative-programming-on-an-ordered-spark-stream-using-java-in-spark-streaming

Is there anyway in spark streaming to keep data across multiple
micro-batches of a sorted dstream, where the stream is sorted using
timestamps? (Assuming monotonically arriving data) Can anyone make
suggestions on how to keep data across iterations where each iteration is
an RDD being processed in JavaDStream?

*What does iteration mean?*

I first sort the dstream using timestamps, assuming that data has arrived
in a monotonically increasing timestamp (no out-of-order).

I need a global HashMap X, which I would like to be updated using values
with timestamp "t1", and then subsequently "t1+1". Since the state of X
itself impacts the calculations it needs to be a linear operation. Hence
operation at "t1+1" depends on HashMap X, which depends on data at and
before "t1".

*Application*

This is especially the case when one is trying to update a model or compare
two sets of RDD's, or keep a global history of certain events etc which
will impact operations in future iterations?

I would like to keep some accumulated history to make calculations.. not
the entire dataset, but persist certain events which can be used in future
DStream RDDs?

Thanks
Nipun

On Wed, Jun 17, 2015 at 11:15 PM, Nipun Arora 
wrote:

> Hi Silvio,
>
> Thanks for your response.
> I should clarify. I would like to do updates on a structure iteratively. I
> am not sure if updateStateByKey meets my criteria.
>
> In the current situation, I can run some map reduce tasks and generate a
> JavaPairDStream, after this my algorithm is necessarily
> sequential ... i.e. I have sorted the data using the timestamp(within the
> messages), and I would like to iterate over it, and maintain a state where
> I can update a model.
>
> I tried using foreach/foreachRDD, and collect to do this, but I can't seem
> to propagate values across microbatches/RDD's.
>
> Any suggestions?
>
> Thanks
> Nipun
>
>
>
> On Wed, Jun 17, 2015 at 10:52 PM, Silvio Fiorito <
> silvio.fior...@granturing.com> wrote:
>
>>   Hi, just answered in your other thread as well...
>>
>>  Depending on your requirements, you can look at the updateStateByKey API
>>
>>   From: Nipun Arora
>> Date: Wednesday, June 17, 2015 at 10:51 PM
>> To: "user@spark.apache.org"
>> Subject: Iterative Programming by keeping data across micro-batches in
>> spark-streaming?
>>
>>   Hi,
>>
>>  Is there anyway in spark streaming to keep data across multiple
>> micro-batches? Like in a HashMap or something?
>> Can anyone make suggestions on how to keep data across iterations where
>> each iteration is an RDD being processed in JavaDStream?
>>
>> This is especially the case when I am trying to update a model or compare
>> two sets of RDD's, or keep a global history of certain events etc which
>> will impact operations in future iterations?
>> I would like to keep some accumulated history to make calculations.. not
>> the entire dataset, but persist certain events which can be used in future
>> JavaDStream RDDs?
>>
>>  Thanks
>> Nipun
>>
>
>


Re: [Spark Streaming] Iterative programming on an ordered spark stream using Java?

2015-06-18 Thread twinkle sachdeva
Hi,

 UpdateStateByKey : if you can brief the issue you are facing with
this,that will be great.

Regarding not keeping whole dataset in memory, you can tweak the parameter
of remember, such that it does checkpoint at appropriate time.

Thanks
Twinkle

On Thursday, June 18, 2015, Nipun Arora  wrote:

> Hi All,
>
> I am updating my question so that I give more detail. I have also created
> a stackexchange question:
> http://stackoverflow.com/questions/30904244/iterative-programming-on-an-ordered-spark-stream-using-java-in-spark-streaming
>
> Is there anyway in spark streaming to keep data across multiple
> micro-batches of a sorted dstream, where the stream is sorted using
> timestamps? (Assuming monotonically arriving data) Can anyone make
> suggestions on how to keep data across iterations where each iteration is
> an RDD being processed in JavaDStream?
>
> *What does iteration mean?*
>
> I first sort the dstream using timestamps, assuming that data has arrived
> in a monotonically increasing timestamp (no out-of-order).
>
> I need a global HashMap X, which I would like to be updated using values
> with timestamp "t1", and then subsequently "t1+1". Since the state of X
> itself impacts the calculations it needs to be a linear operation. Hence
> operation at "t1+1" depends on HashMap X, which depends on data at and
> before "t1".
>
> *Application*
>
> This is especially the case when one is trying to update a model or
> compare two sets of RDD's, or keep a global history of certain events etc
> which will impact operations in future iterations?
>
> I would like to keep some accumulated history to make calculations.. not
> the entire dataset, but persist certain events which can be used in future
> DStream RDDs?
>
> Thanks
> Nipun
>
> On Wed, Jun 17, 2015 at 11:15 PM, Nipun Arora  > wrote:
>
>> Hi Silvio,
>>
>> Thanks for your response.
>> I should clarify. I would like to do updates on a structure iteratively.
>> I am not sure if updateStateByKey meets my criteria.
>>
>> In the current situation, I can run some map reduce tasks and generate a
>> JavaPairDStream, after this my algorithm is necessarily
>> sequential ... i.e. I have sorted the data using the timestamp(within the
>> messages), and I would like to iterate over it, and maintain a state where
>> I can update a model.
>>
>> I tried using foreach/foreachRDD, and collect to do this, but I can't
>> seem to propagate values across microbatches/RDD's.
>>
>> Any suggestions?
>>
>> Thanks
>> Nipun
>>
>>
>>
>> On Wed, Jun 17, 2015 at 10:52 PM, Silvio Fiorito <
>> silvio.fior...@granturing.com
>> > wrote:
>>
>>>   Hi, just answered in your other thread as well...
>>>
>>>  Depending on your requirements, you can look at the updateStateByKey
>>> API
>>>
>>>   From: Nipun Arora
>>> Date: Wednesday, June 17, 2015 at 10:51 PM
>>> To: "user@spark.apache.org
>>> "
>>> Subject: Iterative Programming by keeping data across micro-batches in
>>> spark-streaming?
>>>
>>>   Hi,
>>>
>>>  Is there anyway in spark streaming to keep data across multiple
>>> micro-batches? Like in a HashMap or something?
>>> Can anyone make suggestions on how to keep data across iterations where
>>> each iteration is an RDD being processed in JavaDStream?
>>>
>>> This is especially the case when I am trying to update a model or
>>> compare two sets of RDD's, or keep a global history of certain events etc
>>> which will impact operations in future iterations?
>>> I would like to keep some accumulated history to make calculations.. not
>>> the entire dataset, but persist certain events which can be used in future
>>> JavaDStream RDDs?
>>>
>>>  Thanks
>>> Nipun
>>>
>>
>>
>


Re: [Spark Streaming] Iterative programming on an ordered spark stream using Java?

2015-06-18 Thread Nipun Arora
Hi All,

I appreciate the help :)

Here is a sample code where I am trying to keep the data of the previous
RDD and the current RDD in a foreachRDD in spark stream.
I do not know if the bottom code technically works as I cannot compile it ,
but I am trying to in a way keep the historical reference of the last RDD
in this scenario.
This is the furthest I got.

You can imagine another scenario where I keep historical list where if I
get a certain "order" of events, I store them.

sortedtsStream.foreach(new ABC()); //error here cannot be referenced
from static context, this call is within static main()

class ABC implements Function,
Integer>, Void>{


@Override
public Void call(JavaPairRDD, Integer>
tuple2IntegerJavaPairRDD) throws Exception {
List, Integer>> list =
tuple2IntegerJavaPairRDD.collect();

if(Type4ViolationChecker.this.prevlist!=null && currentlist!=null){
prevlist = currentlist;
currentlist = list;
}
else{
currentlist = list;
prevlist = list;
}

System.out.println("Printing previous");
for (Tuple2, Integer> tuple : prevlist) {
Date date = new Date(tuple._1._1);
int pattern = tuple._1._2;
int count = tuple._2;
System.out.println("TimeSlot: " + date.toString() + "
Pattern: " + pattern + " Count: " + count);
}



System.out.println("Printing current");
for (Tuple2, Integer> tuple : currentlist) {
Date date = new Date(tuple._1._1);
int pattern = tuple._1._2;
int count = tuple._2;
System.out.println("TimeSlot: " + date.toString() + "
Pattern: " + pattern + " Count: " + count);
}

return null;
}
}


Thanks
Nipun

On Thu, Jun 18, 2015 at 11:26 AM, twinkle sachdeva <
twinkle.sachd...@gmail.com> wrote:

> Hi,
>
>  UpdateStateByKey : if you can brief the issue you are facing with
> this,that will be great.
>
> Regarding not keeping whole dataset in memory, you can tweak the parameter
> of remember, such that it does checkpoint at appropriate time.
>
> Thanks
> Twinkle
>
> On Thursday, June 18, 2015, Nipun Arora  wrote:
>
>> Hi All,
>>
>> I am updating my question so that I give more detail. I have also created
>> a stackexchange question:
>> http://stackoverflow.com/questions/30904244/iterative-programming-on-an-ordered-spark-stream-using-java-in-spark-streaming
>>
>> Is there anyway in spark streaming to keep data across multiple
>> micro-batches of a sorted dstream, where the stream is sorted using
>> timestamps? (Assuming monotonically arriving data) Can anyone make
>> suggestions on how to keep data across iterations where each iteration is
>> an RDD being processed in JavaDStream?
>>
>> *What does iteration mean?*
>>
>> I first sort the dstream using timestamps, assuming that data has arrived
>> in a monotonically increasing timestamp (no out-of-order).
>>
>> I need a global HashMap X, which I would like to be updated using values
>> with timestamp "t1", and then subsequently "t1+1". Since the state of X
>> itself impacts the calculations it needs to be a linear operation. Hence
>> operation at "t1+1" depends on HashMap X, which depends on data at and
>> before "t1".
>>
>> *Application*
>>
>> This is especially the case when one is trying to update a model or
>> compare two sets of RDD's, or keep a global history of certain events etc
>> which will impact operations in future iterations?
>>
>> I would like to keep some accumulated history to make calculations.. not
>> the entire dataset, but persist certain events which can be used in future
>> DStream RDDs?
>>
>> Thanks
>> Nipun
>>
>> On Wed, Jun 17, 2015 at 11:15 PM, Nipun Arora 
>> wrote:
>>
>>> Hi Silvio,
>>>
>>> Thanks for your response.
>>> I should clarify. I would like to do updates on a structure iteratively.
>>> I am not sure if updateStateByKey meets my criteria.
>>>
>>> In the current situation, I can run some map reduce tasks and generate a
>>> JavaPairDStream, after this my algorithm is necessarily
>>> sequential ... i.e. I have sorted the data using the timestamp(within the
>>> messages), and I would like to iterate over it, and maintain a state where
>>> I can update a model.
>>>
>>> I tried using foreach/foreachRDD, and collect to do this, but I can't
>>> seem to propagate values across microbatches/RDD's.
>>>
>>> Any suggestions?
>>>
>>> Thanks
>>> Nipun
>>>
>>>
>>>
>>> On Wed, Jun 17, 2015 at 10:52 PM, Silvio Fiorito <
>>> silvio.fior...@granturing.com> wrote:
>>>
   Hi, just answered in your other thread as well...

  Depending on your requirements, you can look at the updateStateByKey
 API

   From: Nipun Arora
 Date: Wednesday, June 17, 2015 at 10:51 PM
 To: "user@spark.apache.org"
 Subject: Iterative Programming by keeping data across micro-batches in
 spark-streaming?

   Hi,

  Is there anyway in spark streaming to keep da

Re: [Spark Streaming] Iterative programming on an ordered spark stream using Java?

2015-06-18 Thread Nipun Arora
@Twinkle - what did you mean by "Regarding not keeping whole dataset in
memory, you can tweak the parameter of remember, such that it does
checkpoint at appropriate time"?

On Thu, Jun 18, 2015 at 11:40 AM, Nipun Arora 
wrote:

> Hi All,
>
> I appreciate the help :)
>
> Here is a sample code where I am trying to keep the data of the previous
> RDD and the current RDD in a foreachRDD in spark stream.
> I do not know if the bottom code technically works as I cannot compile it
> , but I am trying to in a way keep the historical reference of the last RDD
> in this scenario.
> This is the furthest I got.
>
> You can imagine another scenario where I keep historical list where if I
> get a certain "order" of events, I store them.
>
> sortedtsStream.foreach(new ABC()); //error here cannot be referenced from 
> static context, this call is within static main()
>
> class ABC implements Function, Integer>, 
> Void>{
>
>
> @Override
> public Void call(JavaPairRDD, Integer> 
> tuple2IntegerJavaPairRDD) throws Exception {
> List, Integer>> list = 
> tuple2IntegerJavaPairRDD.collect();
>
> if(Type4ViolationChecker.this.prevlist!=null && currentlist!=null){
> prevlist = currentlist;
> currentlist = list;
> }
> else{
> currentlist = list;
> prevlist = list;
> }
>
> System.out.println("Printing previous");
> for (Tuple2, Integer> tuple : prevlist) {
> Date date = new Date(tuple._1._1);
> int pattern = tuple._1._2;
> int count = tuple._2;
> System.out.println("TimeSlot: " + date.toString() + " Pattern: " 
> + pattern + " Count: " + count);
> }
>
>
>
> System.out.println("Printing current");
> for (Tuple2, Integer> tuple : currentlist) {
> Date date = new Date(tuple._1._1);
> int pattern = tuple._1._2;
> int count = tuple._2;
> System.out.println("TimeSlot: " + date.toString() + " Pattern: " 
> + pattern + " Count: " + count);
> }
>
> return null;
> }
> }
>
>
> Thanks
> Nipun
>
> On Thu, Jun 18, 2015 at 11:26 AM, twinkle sachdeva <
> twinkle.sachd...@gmail.com> wrote:
>
>> Hi,
>>
>>  UpdateStateByKey : if you can brief the issue you are facing with
>> this,that will be great.
>>
>> Regarding not keeping whole dataset in memory, you can tweak the
>> parameter of remember, such that it does checkpoint at appropriate time.
>>
>> Thanks
>> Twinkle
>>
>> On Thursday, June 18, 2015, Nipun Arora  wrote:
>>
>>> Hi All,
>>>
>>> I am updating my question so that I give more detail. I have also
>>> created a stackexchange question:
>>> http://stackoverflow.com/questions/30904244/iterative-programming-on-an-ordered-spark-stream-using-java-in-spark-streaming
>>>
>>> Is there anyway in spark streaming to keep data across multiple
>>> micro-batches of a sorted dstream, where the stream is sorted using
>>> timestamps? (Assuming monotonically arriving data) Can anyone make
>>> suggestions on how to keep data across iterations where each iteration is
>>> an RDD being processed in JavaDStream?
>>>
>>> *What does iteration mean?*
>>>
>>> I first sort the dstream using timestamps, assuming that data has
>>> arrived in a monotonically increasing timestamp (no out-of-order).
>>>
>>> I need a global HashMap X, which I would like to be updated using values
>>> with timestamp "t1", and then subsequently "t1+1". Since the state of X
>>> itself impacts the calculations it needs to be a linear operation. Hence
>>> operation at "t1+1" depends on HashMap X, which depends on data at and
>>> before "t1".
>>>
>>> *Application*
>>>
>>> This is especially the case when one is trying to update a model or
>>> compare two sets of RDD's, or keep a global history of certain events etc
>>> which will impact operations in future iterations?
>>>
>>> I would like to keep some accumulated history to make calculations.. not
>>> the entire dataset, but persist certain events which can be used in future
>>> DStream RDDs?
>>>
>>> Thanks
>>> Nipun
>>>
>>> On Wed, Jun 17, 2015 at 11:15 PM, Nipun Arora 
>>> wrote:
>>>
 Hi Silvio,

 Thanks for your response.
 I should clarify. I would like to do updates on a structure
 iteratively. I am not sure if updateStateByKey meets my criteria.

 In the current situation, I can run some map reduce tasks and generate
 a JavaPairDStream, after this my algorithm is necessarily
 sequential ... i.e. I have sorted the data using the timestamp(within the
 messages), and I would like to iterate over it, and maintain a state where
 I can update a model.

 I tried using foreach/foreachRDD, and collect to do this, but I can't
 seem to propagate values across microbatches/RDD's.

 Any suggestions?

 Thanks
 Nipun



 On Wed, Jun 17, 2015 at 10:52 PM, Silvio Fiorito <
 silvio.fior...@granturing.com> wrote: