It sounds like another window operation on top of the 30-min window will achieve the desired objective. Just keep in mind that you'll need to set the clean TTL (spark.cleaner.ttl) to a long enough value and you will require enough resources (mem & disk) to keep the required data.
-kr, Gerard. On Sun, Feb 21, 2016 at 12:54 PM, Jatin Kumar < jku...@rocketfuelinc.com.invalid> wrote: > Hello Spark users, > > I have to aggregate messages from kafka and at some fixed interval (say > every half hour) update a memory persisted RDD and run some computation. > This computation uses last one day data. Steps are: > > - Read from realtime Kafka topic X in spark streaming batches of 5 seconds > - Filter the above DStream messages and keep some of them > - Create windows of 30 minutes on above DStream and aggregate by Key > - Merge this 30 minute RDD with a memory persisted RDD say combinedRdd > - Maintain last N such RDDs in a deque persisting them on disk. While > adding new RDD, subtract oldest RDD from the combinedRdd. > - Final step consider last N such windows (of 30 minutes each) and do > final aggregation > > Does the above way of using spark streaming looks reasonable? Is there a > better way of doing the above? > > -- > Thanks > Jatin > >