It sounds like another  window operation on top of the 30-min window will
achieve the  desired objective.
Just keep in mind that you'll need to set the clean TTL (spark.cleaner.ttl)
to a long enough value and you will require enough resources (mem & disk)
to keep the required data.

-kr, Gerard.

On Sun, Feb 21, 2016 at 12:54 PM, Jatin Kumar <
jku...@rocketfuelinc.com.invalid> wrote:

> Hello Spark users,
>
> I have to aggregate messages from kafka and at some fixed interval (say
> every half hour) update a memory persisted RDD and run some computation.
> This computation uses last one day data. Steps are:
>
> - Read from realtime Kafka topic X in spark streaming batches of 5 seconds
> - Filter the above DStream messages and keep some of them
> - Create windows of 30 minutes on above DStream and aggregate by Key
> - Merge this 30 minute RDD with a memory persisted RDD say combinedRdd
> - Maintain last N such RDDs in a deque persisting them on disk. While
> adding new RDD, subtract oldest RDD from the combinedRdd.
> - Final step consider last N such windows (of 30 minutes each) and do
> final aggregation
>
> Does the above way of using spark streaming looks reasonable? Is there a
> better way of doing the above?
>
> --
> Thanks
> Jatin
>
>

Reply via email to