Off the top of my head... (Each may have it's own issues) If upstream you add a uniqueId to all your records, then you may use a BloomFilter to approximate if you've seen a row before. The problem I can see with that approach is how to repopulate the bloom filter on restarts.
If you are certain that you're not going to reprocess some data after a certain time, i.e. there is no way I'm going to get the same data in 2 hours, it may only happen in the last 2 hours, then you may also keep the state of uniqueId's as well, and then age them out after a certain time. Best, Burak On Tue, Jan 24, 2017 at 9:53 PM, shyla deshpande <deshpandesh...@gmail.com> wrote: > Please share your thoughts..... > > On Tue, Jan 24, 2017 at 4:01 PM, shyla deshpande <deshpandesh...@gmail.com > > wrote: > >> >> >> On Tue, Jan 24, 2017 at 9:44 AM, shyla deshpande < >> deshpandesh...@gmail.com> wrote: >> >>> My streaming application stores lot of aggregations using mapWithState. >>> >>> I want to know what are all the possible ways I can make it idempotent. >>> >>> Please share your views. >>> >>> Thanks >>> >>> On Mon, Jan 23, 2017 at 5:41 PM, shyla deshpande < >>> deshpandesh...@gmail.com> wrote: >>> >>>> In a Wordcount application which stores the count of all the words >>>> input so far using mapWithState. How do I make sure my counts are not >>>> messed up if I happen to read a line more than once? >>>> >>>> Appreciate your response. >>>> >>>> Thanks >>>> >>> >>> >> >