I noticed that 1 wouldn't be a problem, because you'll save the BloomFilter in the state.
For 2, you would keep a Map of UUID's to the timestamp of when you saw them. If the UUID exists in the map, then you wouldn't increase the count. If the timestamp of a UUID expires, you would remove it from the map. The reason we remove from the map is to keep a bounded amount of space. It'll probably take a lot more space than the BloomFilter though depending on your data volume. On Wed, Jan 25, 2017 at 11:24 AM, shyla deshpande <deshpandesh...@gmail.com> wrote: > In the previous email you gave me 2 solutions > 1. Bloom filter --> problem in repopulating the bloom filter on restarts > 2. keeping the state of the unique ids > > Please elaborate on 2. > > > > On Wed, Jan 25, 2017 at 10:53 AM, Burak Yavuz <brk...@gmail.com> wrote: > >> I don't have any sample code, but on a high level: >> >> My state would be: (Long, BloomFilter[UUID]) >> In the update function, my value will be the UUID of the record, since >> the word itself is the key. >> I'll ask my BloomFilter if I've seen this UUID before. If not increase >> count, also add to Filter. >> >> Does that make sense? >> >> >> On Wed, Jan 25, 2017 at 9:28 AM, shyla deshpande < >> deshpandesh...@gmail.com> wrote: >> >>> Hi Burak, >>> Thanks for the response. Can you please elaborate on your idea of >>> storing the state of the unique ids. >>> Do you have any sample code or links I can refer to. >>> Thanks >>> >>> On Wed, Jan 25, 2017 at 9:13 AM, Burak Yavuz <brk...@gmail.com> wrote: >>> >>>> Off the top of my head... (Each may have it's own issues) >>>> >>>> If upstream you add a uniqueId to all your records, then you may use a >>>> BloomFilter to approximate if you've seen a row before. >>>> The problem I can see with that approach is how to repopulate the bloom >>>> filter on restarts. >>>> >>>> If you are certain that you're not going to reprocess some data after a >>>> certain time, i.e. there is no way I'm going to get the same data in 2 >>>> hours, it may only happen in the last 2 hours, then you may also keep the >>>> state of uniqueId's as well, and then age them out after a certain time. >>>> >>>> >>>> Best, >>>> Burak >>>> >>>> On Tue, Jan 24, 2017 at 9:53 PM, shyla deshpande < >>>> deshpandesh...@gmail.com> wrote: >>>> >>>>> Please share your thoughts..... >>>>> >>>>> On Tue, Jan 24, 2017 at 4:01 PM, shyla deshpande < >>>>> deshpandesh...@gmail.com> wrote: >>>>> >>>>>> >>>>>> >>>>>> On Tue, Jan 24, 2017 at 9:44 AM, shyla deshpande < >>>>>> deshpandesh...@gmail.com> wrote: >>>>>> >>>>>>> My streaming application stores lot of aggregations using >>>>>>> mapWithState. >>>>>>> >>>>>>> I want to know what are all the possible ways I can make it >>>>>>> idempotent. >>>>>>> >>>>>>> Please share your views. >>>>>>> >>>>>>> Thanks >>>>>>> >>>>>>> On Mon, Jan 23, 2017 at 5:41 PM, shyla deshpande < >>>>>>> deshpandesh...@gmail.com> wrote: >>>>>>> >>>>>>>> In a Wordcount application which stores the count of all the words >>>>>>>> input so far using mapWithState. How do I make sure my counts are not >>>>>>>> messed up if I happen to read a line more than once? >>>>>>>> >>>>>>>> Appreciate your response. >>>>>>>> >>>>>>>> Thanks >>>>>>>> >>>>>>> >>>>>>> >>>>>> >>>>> >>>> >>> >> >