RDD is immutable, it cannot be changed, you can only create a new one from data or from transformation. It sounds inefficient to create one each 15 sec for the last 24 hours. I think a key-value store will be much more fitted for this purpose.
On Mon, Jul 27, 2015 at 11:21 AM Shushant Arora <shushantaror...@gmail.com> wrote: > its for 1 day events in range of 1 billions and processing is in streaming > application of ~10-15 sec interval so lookup should be fast. RDD need to > be updated with new events and old events of current time-24 hours back > should be removed at each processing. > > So is spark RDD not fit for this requirement? > > On Mon, Jul 27, 2015 at 1:08 PM, Romi Kuntsman <r...@totango.com> wrote: > >> What the throughput of processing and for how long do you need to >> remember duplicates? >> >> You can take all the events, put them in an RDD, group by the key, and >> then process each key only once. >> But if you have a long running application where you want to check that >> you didn't see the same value before, and check that for every value, you >> probably need a key-value store, not RDD. >> >> On Sun, Jul 26, 2015 at 7:38 PM Shushant Arora <shushantaror...@gmail.com> >> wrote: >> >>> Hi >>> >>> I have a requirement for processing large events but ignoring duplicate >>> at the same time. >>> >>> Events are consumed from kafka and each event has a eventid. It may >>> happen that an event is already processed and came again at some other >>> offset. >>> >>> 1.Can I use Spark RDD to persist processed events and then lookup with >>> this rdd (How to do lookup inside a RDD ?I have a >>> JavaPairRDD<eventid,timestamp> ) >>> while processing new events and if event is present in persisted rdd >>> ignore it , else process the even. Does rdd.lookup(key) on billion of >>> events will be efficient ? >>> >>> 2. update the rdd (Since RDD is immutable how to update it)? >>> >>> Thanks >>> >>> >