Hi, I am pretty new to spark/spark_streaming so please excuse my naivety. I have streaming event stream that is timestamped and I would like to aggregate it into, let's say, hourly buckets. Now the simple answer is to use a window operation with window length of 1 hr and sliding interval of 1hr. But this sort of doesn't exactly work:
1. The time boundaries aren't exactly perfect. i.e. the process/stream aggreagation may get started at the middle of the hour so the 1st hour may actually be less than 1 hour long and then subsequent hours should be aligned to the next hour. 2. The If I understand this correctly, the above method would mean that all my data is "collected" for 1 hour and then summarised. Though correct, how do I get the aggregations to occur more frequently than that. Something like "aggregate these events into hourly buckets updating it every 5 seconds". I would really appreciate pointers to code samples or some blogs that could help me identify best practices. -- Ankur Chauhan
signature.asc
Description: Message signed with OpenPGP using GPGMail