Re: Calculating Timeseries Aggregation

2015-11-19 Thread Sanket Patil
Hey Sandip:

TD has already outlined the right approach, but let me add a couple of
thoughts as I recently worked on a similar project. I had to compute some
real-time metrics on streaming data. Also, these metrics had to be
aggregated for hour/day/week/month. My data pipeline was Kafka --> Spark
Streaming --> Cassandra.

I had a spark streaming job that did the following: (1) receive a window of
raw streaming data and write it to Cassandra, and (2) do only the basic
computations that need to be shown on a real-time dashboard, and store the
results in Cassandra. (I had to use sliding window as my computation
involved joining data that might occur in different time windows.)

I had a separate set of Spark jobs that pulled the raw data from Cassandra,
computed the aggregations and more complex metrics, and wrote it back to
the relevant Cassandra tables. These jobs ran periodically every few
minutes.

Regards,
Sanket

On Thu, Nov 19, 2015 at 8:09 AM, Sandip Mehta 
wrote:

> Thank you TD for your time and help.
>
> SM
>
> On 19-Nov-2015, at 6:58 AM, Tathagata Das  wrote:
>
> There are different ways to do the rollups. Either update rollups from the
> streaming application, or you can generate roll ups in a later process -
> say periodic Spark job every hour. Or you could just generate rollups on
> demand, when it is queried.
> The whole thing depends on your downstream requirements - if you always to
> have up to date rollups to show up in dashboard (even day-level stuff),
> then the first approach is better. Otherwise, second and third approaches
> are more efficient.
>
> TD
>
>
> On Wed, Nov 18, 2015 at 7:15 AM, Sandip Mehta 
> wrote:
>
>> TD thank you for your reply.
>>
>> I agree on data store requirement. I am using HBase as an underlying
>> store.
>>
>> So for every batch interval of say 10 seconds
>>
>> - Calculate the time dimension ( minutes, hours, day, week, month and
>> quarter ) along with other dimensions and metrics
>> - Update relevant base table at each batch interval for relevant metrics
>> for a given set of dimensions.
>>
>> Only caveat I see is I’ll have to update each of the different roll up
>> table for each batch window.
>>
>> Is this a valid approach for calculating time series aggregation?
>>
>> Regards
>> SM
>>
>> For minutes level aggregates I have set up a streaming window say 10
>> seconds and storing minutes level aggregates across multiple dimension in
>> HBase at every window interval.
>>
>> On 18-Nov-2015, at 7:45 AM, Tathagata Das  wrote:
>>
>> For this sort of long term aggregations you should use a dedicated data
>> storage systems. Like a database, or a key-value store. Spark Streaming
>> would just aggregate and push the necessary data to the data store.
>>
>> TD
>>
>> On Sat, Nov 14, 2015 at 9:32 PM, Sandip Mehta > > wrote:
>>
>>> Hi,
>>>
>>> I am working on requirement of calculating real time metrics and
>>> building prototype  on Spark streaming. I need to build aggregate at
>>> Seconds, Minutes, Hours and Day level.
>>>
>>> I am not sure whether I should calculate all these aggregates as
>>> different Windowed function on input DStream or shall I use
>>> updateStateByKey function for the same. If I have to use updateStateByKey
>>> for these time series aggregation, how can I remove keys from the state
>>> after different time lapsed?
>>>
>>> Please suggest.
>>>
>>> Regards
>>> SM
>>> -
>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>>> For additional commands, e-mail: user-h...@spark.apache.org
>>>
>>>
>>
>>
>
>


-- 
SuperReceptionist is now available on Android mobiles. Track your business
on the go with call analytics, recordings, insights and more: Download the
app here


-- 
SuperReceptionist is now available on Android mobiles. Track your business 
on the go with call analytics, recordings, insights and more: Download the 
app here 



Re: Calculating Timeseries Aggregation

2015-11-19 Thread Sandip Mehta
Thank you Sanket for the feedback.

Regards
SM
> On 19-Nov-2015, at 1:57 PM, Sanket Patil  wrote:
> 
> Hey Sandip:
> 
> TD has already outlined the right approach, but let me add a couple of 
> thoughts as I recently worked on a similar project. I had to compute some 
> real-time metrics on streaming data. Also, these metrics had to be aggregated 
> for hour/day/week/month. My data pipeline was Kafka --> Spark Streaming --> 
> Cassandra.
> 
> I had a spark streaming job that did the following: (1) receive a window of 
> raw streaming data and write it to Cassandra, and (2) do only the basic 
> computations that need to be shown on a real-time dashboard, and store the 
> results in Cassandra. (I had to use sliding window as my computation involved 
> joining data that might occur in different time windows.)
> 
> I had a separate set of Spark jobs that pulled the raw data from Cassandra, 
> computed the aggregations and more complex metrics, and wrote it back to the 
> relevant Cassandra tables. These jobs ran periodically every few minutes.
> 
> Regards,
> Sanket
> 
> On Thu, Nov 19, 2015 at 8:09 AM, Sandip Mehta  > wrote:
> Thank you TD for your time and help.
> 
> SM
>> On 19-Nov-2015, at 6:58 AM, Tathagata Das > > wrote:
>> 
>> There are different ways to do the rollups. Either update rollups from the 
>> streaming application, or you can generate roll ups in a later process - say 
>> periodic Spark job every hour. Or you could just generate rollups on demand, 
>> when it is queried.
>> The whole thing depends on your downstream requirements - if you always to 
>> have up to date rollups to show up in dashboard (even day-level stuff), then 
>> the first approach is better. Otherwise, second and third approaches are 
>> more efficient.
>> 
>> TD
>> 
>> 
>> On Wed, Nov 18, 2015 at 7:15 AM, Sandip Mehta > > wrote:
>> TD thank you for your reply.
>> 
>> I agree on data store requirement. I am using HBase as an underlying store.
>> 
>> So for every batch interval of say 10 seconds
>> 
>> - Calculate the time dimension ( minutes, hours, day, week, month and 
>> quarter ) along with other dimensions and metrics
>> - Update relevant base table at each batch interval for relevant metrics for 
>> a given set of dimensions.
>> 
>> Only caveat I see is I’ll have to update each of the different roll up table 
>> for each batch window.
>> 
>> Is this a valid approach for calculating time series aggregation?
>> 
>> Regards
>> SM
>> 
>> For minutes level aggregates I have set up a streaming window say 10 seconds 
>> and storing minutes level aggregates across multiple dimension in HBase at 
>> every window interval. 
>> 
>>> On 18-Nov-2015, at 7:45 AM, Tathagata Das >> > wrote:
>>> 
>>> For this sort of long term aggregations you should use a dedicated data 
>>> storage systems. Like a database, or a key-value store. Spark Streaming 
>>> would just aggregate and push the necessary data to the data store. 
>>> 
>>> TD
>>> 
>>> On Sat, Nov 14, 2015 at 9:32 PM, Sandip Mehta >> > wrote:
>>> Hi,
>>> 
>>> I am working on requirement of calculating real time metrics and building 
>>> prototype  on Spark streaming. I need to build aggregate at Seconds, 
>>> Minutes, Hours and Day level.
>>> 
>>> I am not sure whether I should calculate all these aggregates as  different 
>>> Windowed function on input DStream or shall I use updateStateByKey function 
>>> for the same. If I have to use updateStateByKey for these time series 
>>> aggregation, how can I remove keys from the state after different time 
>>> lapsed?
>>> 
>>> Please suggest.
>>> 
>>> Regards
>>> SM
>>> -
>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org 
>>> 
>>> For additional commands, e-mail: user-h...@spark.apache.org 
>>> 
>>> 
>>> 
>> 
>> 
> 
> 
> 
> 
> -- 
> SuperReceptionist is now available on Android mobiles. Track your business on 
> the go with call analytics, recordings, insights and more: Download the app 
> here 
> 
> SuperReceptionist is now available on Android mobiles. Track your business on 
> the go with call analytics, recordings, insights and more: Download the app 
> here 


Re: Calculating Timeseries Aggregation

2015-11-18 Thread Sandip Mehta
TD thank you for your reply.

I agree on data store requirement. I am using HBase as an underlying store.

So for every batch interval of say 10 seconds

- Calculate the time dimension ( minutes, hours, day, week, month and quarter ) 
along with other dimensions and metrics
- Update relevant base table at each batch interval for relevant metrics for a 
given set of dimensions.

Only caveat I see is I’ll have to update each of the different roll up table 
for each batch window.

Is this a valid approach for calculating time series aggregation?

Regards
SM

For minutes level aggregates I have set up a streaming window say 10 seconds 
and storing minutes level aggregates across multiple dimension in HBase at 
every window interval. 

> On 18-Nov-2015, at 7:45 AM, Tathagata Das  wrote:
> 
> For this sort of long term aggregations you should use a dedicated data 
> storage systems. Like a database, or a key-value store. Spark Streaming would 
> just aggregate and push the necessary data to the data store. 
> 
> TD
> 
> On Sat, Nov 14, 2015 at 9:32 PM, Sandip Mehta  > wrote:
> Hi,
> 
> I am working on requirement of calculating real time metrics and building 
> prototype  on Spark streaming. I need to build aggregate at Seconds, Minutes, 
> Hours and Day level.
> 
> I am not sure whether I should calculate all these aggregates as  different 
> Windowed function on input DStream or shall I use updateStateByKey function 
> for the same. If I have to use updateStateByKey for these time series 
> aggregation, how can I remove keys from the state after different time lapsed?
> 
> Please suggest.
> 
> Regards
> SM
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org 
> 
> For additional commands, e-mail: user-h...@spark.apache.org 
> 
> 
> 



Re: Calculating Timeseries Aggregation

2015-11-18 Thread Sandip Mehta
Thank you TD for your time and help.

SM
> On 19-Nov-2015, at 6:58 AM, Tathagata Das  wrote:
> 
> There are different ways to do the rollups. Either update rollups from the 
> streaming application, or you can generate roll ups in a later process - say 
> periodic Spark job every hour. Or you could just generate rollups on demand, 
> when it is queried.
> The whole thing depends on your downstream requirements - if you always to 
> have up to date rollups to show up in dashboard (even day-level stuff), then 
> the first approach is better. Otherwise, second and third approaches are more 
> efficient.
> 
> TD
> 
> 
> On Wed, Nov 18, 2015 at 7:15 AM, Sandip Mehta  > wrote:
> TD thank you for your reply.
> 
> I agree on data store requirement. I am using HBase as an underlying store.
> 
> So for every batch interval of say 10 seconds
> 
> - Calculate the time dimension ( minutes, hours, day, week, month and quarter 
> ) along with other dimensions and metrics
> - Update relevant base table at each batch interval for relevant metrics for 
> a given set of dimensions.
> 
> Only caveat I see is I’ll have to update each of the different roll up table 
> for each batch window.
> 
> Is this a valid approach for calculating time series aggregation?
> 
> Regards
> SM
> 
> For minutes level aggregates I have set up a streaming window say 10 seconds 
> and storing minutes level aggregates across multiple dimension in HBase at 
> every window interval. 
> 
>> On 18-Nov-2015, at 7:45 AM, Tathagata Das > > wrote:
>> 
>> For this sort of long term aggregations you should use a dedicated data 
>> storage systems. Like a database, or a key-value store. Spark Streaming 
>> would just aggregate and push the necessary data to the data store. 
>> 
>> TD
>> 
>> On Sat, Nov 14, 2015 at 9:32 PM, Sandip Mehta > > wrote:
>> Hi,
>> 
>> I am working on requirement of calculating real time metrics and building 
>> prototype  on Spark streaming. I need to build aggregate at Seconds, 
>> Minutes, Hours and Day level.
>> 
>> I am not sure whether I should calculate all these aggregates as  different 
>> Windowed function on input DStream or shall I use updateStateByKey function 
>> for the same. If I have to use updateStateByKey for these time series 
>> aggregation, how can I remove keys from the state after different time 
>> lapsed?
>> 
>> Please suggest.
>> 
>> Regards
>> SM
>> -
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org 
>> 
>> For additional commands, e-mail: user-h...@spark.apache.org 
>> 
>> 
>> 
> 
> 



Re: Calculating Timeseries Aggregation

2015-11-18 Thread Tathagata Das
There are different ways to do the rollups. Either update rollups from the
streaming application, or you can generate roll ups in a later process -
say periodic Spark job every hour. Or you could just generate rollups on
demand, when it is queried.
The whole thing depends on your downstream requirements - if you always to
have up to date rollups to show up in dashboard (even day-level stuff),
then the first approach is better. Otherwise, second and third approaches
are more efficient.

TD


On Wed, Nov 18, 2015 at 7:15 AM, Sandip Mehta 
wrote:

> TD thank you for your reply.
>
> I agree on data store requirement. I am using HBase as an underlying store.
>
> So for every batch interval of say 10 seconds
>
> - Calculate the time dimension ( minutes, hours, day, week, month and
> quarter ) along with other dimensions and metrics
> - Update relevant base table at each batch interval for relevant metrics
> for a given set of dimensions.
>
> Only caveat I see is I’ll have to update each of the different roll up
> table for each batch window.
>
> Is this a valid approach for calculating time series aggregation?
>
> Regards
> SM
>
> For minutes level aggregates I have set up a streaming window say 10
> seconds and storing minutes level aggregates across multiple dimension in
> HBase at every window interval.
>
> On 18-Nov-2015, at 7:45 AM, Tathagata Das  wrote:
>
> For this sort of long term aggregations you should use a dedicated data
> storage systems. Like a database, or a key-value store. Spark Streaming
> would just aggregate and push the necessary data to the data store.
>
> TD
>
> On Sat, Nov 14, 2015 at 9:32 PM, Sandip Mehta 
> wrote:
>
>> Hi,
>>
>> I am working on requirement of calculating real time metrics and building
>> prototype  on Spark streaming. I need to build aggregate at Seconds,
>> Minutes, Hours and Day level.
>>
>> I am not sure whether I should calculate all these aggregates as
>> different Windowed function on input DStream or shall I use
>> updateStateByKey function for the same. If I have to use updateStateByKey
>> for these time series aggregation, how can I remove keys from the state
>> after different time lapsed?
>>
>> Please suggest.
>>
>> Regards
>> SM
>> -
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>
>>
>
>


Re: Calculating Timeseries Aggregation

2015-11-17 Thread Tathagata Das
For this sort of long term aggregations you should use a dedicated data
storage systems. Like a database, or a key-value store. Spark Streaming
would just aggregate and push the necessary data to the data store.

TD

On Sat, Nov 14, 2015 at 9:32 PM, Sandip Mehta 
wrote:

> Hi,
>
> I am working on requirement of calculating real time metrics and building
> prototype  on Spark streaming. I need to build aggregate at Seconds,
> Minutes, Hours and Day level.
>
> I am not sure whether I should calculate all these aggregates as
> different Windowed function on input DStream or shall I use
> updateStateByKey function for the same. If I have to use updateStateByKey
> for these time series aggregation, how can I remove keys from the state
> after different time lapsed?
>
> Please suggest.
>
> Regards
> SM
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


Calculating Timeseries Aggregation

2015-11-14 Thread Sandip Mehta
Hi,

I am working on requirement of calculating real time metrics and building 
prototype  on Spark streaming. I need to build aggregate at Seconds, Minutes, 
Hours and Day level. 

I am not sure whether I should calculate all these aggregates as  different 
Windowed function on input DStream or shall I use updateStateByKey function for 
the same. If I have to use updateStateByKey for these time series aggregation, 
how can I remove keys from the state after different time lapsed?

Please suggest.

Regards
SM
-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org