Hi,

I am pretty new to spark/spark_streaming so please excuse my naivety. I have 
streaming event stream that is timestamped and I would like to aggregate it 
into, let's say, hourly buckets. Now the simple answer is to use a window 
operation with window length of 1 hr and sliding interval of 1hr. But this sort 
of doesn't exactly work:

1. The time boundaries aren't exactly perfect. i.e. the process/stream 
aggreagation may get started at the middle of the hour so the 1st hour may 
actually be less than 1 hour long and then subsequent hours should be aligned 
to the next hour.
2. The If I understand this correctly, the above method would mean that all my 
data is "collected" for 1 hour and then summarised. Though correct, how do I 
get the aggregations to occur more frequently than that. Something like 
"aggregate these events into hourly buckets updating it every 5 seconds".

I would really appreciate pointers to code samples or some blogs that could 
help me identify best practices.

-- Ankur Chauhan

Attachment: signature.asc
Description: Message signed with OpenPGP using GPGMail

Reply via email to