Re: Maintaining overall cumulative data in Spark Streaming

2015-10-30 Thread Silvio Fiorito
In the update function you can return None for a key and it will remove it. If 
you’re restarting your app you can delete your checkpoint directory to start 
from scratch, rather than continuing from the previous state.

From: Sandeep Giri <sand...@knowbigdata.com<mailto:sand...@knowbigdata.com>>
Date: Friday, October 30, 2015 at 9:29 AM
To: skaarthik oss <skaarthik@gmail.com<mailto:skaarthik@gmail.com>>
Cc: dev <d...@spark.apache.org<mailto:d...@spark.apache.org>>, user 
<user@spark.apache.org<mailto:user@spark.apache.org>>
Subject: Re: Maintaining overall cumulative data in Spark Streaming

How to we reset the aggregated statistics to null?

Regards,
Sandeep Giri,
+1 347 781 4573 (US)
+91-953-899-8962 (IN)

www.KnowBigData.com.<http://KnowBigData.com.>
Phone: +1-253-397-1945 (Office)

[linkedin icon]<https://linkedin.com/company/knowbigdata>[other site 
icon]<http://knowbigdata.com> [facebook icon] 
<https://facebook.com/knowbigdata> [twitter icon] 
<https://twitter.com/IKnowBigData> <https://twitter.com/IKnowBigData>


On Fri, Oct 30, 2015 at 9:49 AM, Sandeep Giri 
<sand...@knowbigdata.com<mailto:sand...@knowbigdata.com>> wrote:

Yes, update state by key worked.

Though there are some more complications.

On Oct 30, 2015 8:27 AM, "skaarthik oss" 
<skaarthik@gmail.com<mailto:skaarthik@gmail.com>> wrote:
Did you consider UpdateStateByKey operation?

From: Sandeep Giri 
[mailto:sand...@knowbigdata.com<mailto:sand...@knowbigdata.com>]
Sent: Thursday, October 29, 2015 3:09 PM
To: user <user@spark.apache.org<mailto:user@spark.apache.org>>; dev 
<d...@spark.apache.org<mailto:d...@spark.apache.org>>
Subject: Maintaining overall cumulative data in Spark Streaming

Dear All,

If a continuous stream of text is coming in and you have to keep publishing the 
overall word count so far since 0:00 today, what would you do?

Publishing the results for a window is easy but if we have to keep aggregating 
the results, how to go about it?

I have tried to keep an StreamRDD with aggregated count and keep doing a 
fullouterjoin but didn't work. Seems like the StreamRDD gets reset.

Kindly help.

Regards,
Sandeep Giri




Re: Maintaining overall cumulative data in Spark Streaming

2015-10-30 Thread Sandeep Giri
How to we reset the aggregated statistics to null?

Regards,
Sandeep Giri,
+1 347 781 4573 (US)
+91-953-899-8962 (IN)

www.KnowBigData.com. 
Phone: +1-253-397-1945 (Office)

[image: linkedin icon]  [image:
other site icon]   [image: facebook icon]
 [image: twitter icon]
 


On Fri, Oct 30, 2015 at 9:49 AM, Sandeep Giri 
wrote:

> Yes, update state by key worked.
>
> Though there are some more complications.
> On Oct 30, 2015 8:27 AM, "skaarthik oss"  wrote:
>
>> Did you consider UpdateStateByKey operation?
>>
>>
>>
>> *From:* Sandeep Giri [mailto:sand...@knowbigdata.com]
>> *Sent:* Thursday, October 29, 2015 3:09 PM
>> *To:* user ; dev 
>> *Subject:* Maintaining overall cumulative data in Spark Streaming
>>
>>
>>
>> Dear All,
>>
>>
>>
>> If a continuous stream of text is coming in and you have to keep
>> publishing the overall word count so far since 0:00 today, what would you
>> do?
>>
>>
>>
>> Publishing the results for a window is easy but if we have to keep
>> aggregating the results, how to go about it?
>>
>>
>>
>> I have tried to keep an StreamRDD with aggregated count and keep doing a
>> fullouterjoin but didn't work. Seems like the StreamRDD gets reset.
>>
>>
>>
>> Kindly help.
>>
>>
>>
>> Regards,
>>
>> Sandeep Giri
>>
>>
>>
>


RE: Maintaining overall cumulative data in Spark Streaming

2015-10-29 Thread Silvio Fiorito
You could use updateStateByKey. There's a stateful word count example on Github.

https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/streaming/StatefulNetworkWordCount.scala

From: Sandeep Giri
Sent: ‎10/‎29/‎2015 6:08 PM
To: user; dev
Subject: Maintaining overall cumulative data in Spark Streaming

Dear All,

If a continuous stream of text is coming in and you have to keep publishing the 
overall word count so far since 0:00 today, what would you do?

Publishing the results for a window is easy but if we have to keep aggregating 
the results, how to go about it?

I have tried to keep an StreamRDD with aggregated count and keep doing a 
fullouterjoin but didn't work. Seems like the StreamRDD gets reset.

Kindly help.

Regards,
Sandeep Giri



RE: Maintaining overall cumulative data in Spark Streaming

2015-10-29 Thread Sandeep Giri
Yes, update state by key worked.

Though there are some more complications.
On Oct 30, 2015 8:27 AM, "skaarthik oss"  wrote:

> Did you consider UpdateStateByKey operation?
>
>
>
> *From:* Sandeep Giri [mailto:sand...@knowbigdata.com]
> *Sent:* Thursday, October 29, 2015 3:09 PM
> *To:* user ; dev 
> *Subject:* Maintaining overall cumulative data in Spark Streaming
>
>
>
> Dear All,
>
>
>
> If a continuous stream of text is coming in and you have to keep
> publishing the overall word count so far since 0:00 today, what would you
> do?
>
>
>
> Publishing the results for a window is easy but if we have to keep
> aggregating the results, how to go about it?
>
>
>
> I have tried to keep an StreamRDD with aggregated count and keep doing a
> fullouterjoin but didn't work. Seems like the StreamRDD gets reset.
>
>
>
> Kindly help.
>
>
>
> Regards,
>
> Sandeep Giri
>
>
>