Thanks Burak. I do want accuracy, that is why I want to make it idempotent. I will try out your 2nd solution.
On Wed, Jan 25, 2017 at 12:27 PM, Burak Yavuz <brk...@gmail.com> wrote: > Yes you may. Depends on if you want exact values or if you're okay with > approximations. With Big Data, generally you would be okay with > approximations. Try both out, see what scales/works with your dataset. > Maybe you may handle the second implementation. > > On Wed, Jan 25, 2017 at 12:23 PM, shyla deshpande < > deshpandesh...@gmail.com> wrote: > >> Thanks Burak. But with BloomFilter, won't I be getting a false poisitve? >> >> On Wed, Jan 25, 2017 at 11:28 AM, Burak Yavuz <brk...@gmail.com> wrote: >> >>> I noticed that 1 wouldn't be a problem, because you'll save the >>> BloomFilter in the state. >>> >>> For 2, you would keep a Map of UUID's to the timestamp of when you saw >>> them. If the UUID exists in the map, then you wouldn't increase the count. >>> If the timestamp of a UUID expires, you would remove it from the map. The >>> reason we remove from the map is to keep a bounded amount of space. It'll >>> probably take a lot more space than the BloomFilter though depending on >>> your data volume. >>> >>> On Wed, Jan 25, 2017 at 11:24 AM, shyla deshpande < >>> deshpandesh...@gmail.com> wrote: >>> >>>> In the previous email you gave me 2 solutions >>>> 1. Bloom filter --> problem in repopulating the bloom filter on >>>> restarts >>>> 2. keeping the state of the unique ids >>>> >>>> Please elaborate on 2. >>>> >>>> >>>> >>>> On Wed, Jan 25, 2017 at 10:53 AM, Burak Yavuz <brk...@gmail.com> wrote: >>>> >>>>> I don't have any sample code, but on a high level: >>>>> >>>>> My state would be: (Long, BloomFilter[UUID]) >>>>> In the update function, my value will be the UUID of the record, since >>>>> the word itself is the key. >>>>> I'll ask my BloomFilter if I've seen this UUID before. If not increase >>>>> count, also add to Filter. >>>>> >>>>> Does that make sense? >>>>> >>>>> >>>>> On Wed, Jan 25, 2017 at 9:28 AM, shyla deshpande < >>>>> deshpandesh...@gmail.com> wrote: >>>>> >>>>>> Hi Burak, >>>>>> Thanks for the response. Can you please elaborate on your idea of >>>>>> storing the state of the unique ids. >>>>>> Do you have any sample code or links I can refer to. >>>>>> Thanks >>>>>> >>>>>> On Wed, Jan 25, 2017 at 9:13 AM, Burak Yavuz <brk...@gmail.com> >>>>>> wrote: >>>>>> >>>>>>> Off the top of my head... (Each may have it's own issues) >>>>>>> >>>>>>> If upstream you add a uniqueId to all your records, then you may use >>>>>>> a BloomFilter to approximate if you've seen a row before. >>>>>>> The problem I can see with that approach is how to repopulate the >>>>>>> bloom filter on restarts. >>>>>>> >>>>>>> If you are certain that you're not going to reprocess some data >>>>>>> after a certain time, i.e. there is no way I'm going to get the same >>>>>>> data >>>>>>> in 2 hours, it may only happen in the last 2 hours, then you may also >>>>>>> keep >>>>>>> the state of uniqueId's as well, and then age them out after a certain >>>>>>> time. >>>>>>> >>>>>>> >>>>>>> Best, >>>>>>> Burak >>>>>>> >>>>>>> On Tue, Jan 24, 2017 at 9:53 PM, shyla deshpande < >>>>>>> deshpandesh...@gmail.com> wrote: >>>>>>> >>>>>>>> Please share your thoughts..... >>>>>>>> >>>>>>>> On Tue, Jan 24, 2017 at 4:01 PM, shyla deshpande < >>>>>>>> deshpandesh...@gmail.com> wrote: >>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> On Tue, Jan 24, 2017 at 9:44 AM, shyla deshpande < >>>>>>>>> deshpandesh...@gmail.com> wrote: >>>>>>>>> >>>>>>>>>> My streaming application stores lot of aggregations using >>>>>>>>>> mapWithState. >>>>>>>>>> >>>>>>>>>> I want to know what are all the possible ways I can make it >>>>>>>>>> idempotent. >>>>>>>>>> >>>>>>>>>> Please share your views. >>>>>>>>>> >>>>>>>>>> Thanks >>>>>>>>>> >>>>>>>>>> On Mon, Jan 23, 2017 at 5:41 PM, shyla deshpande < >>>>>>>>>> deshpandesh...@gmail.com> wrote: >>>>>>>>>> >>>>>>>>>>> In a Wordcount application which stores the count of all the >>>>>>>>>>> words input so far using mapWithState. How do I make sure my >>>>>>>>>>> counts are >>>>>>>>>>> not messed up if I happen to read a line more than once? >>>>>>>>>>> >>>>>>>>>>> Appreciate your response. >>>>>>>>>>> >>>>>>>>>>> Thanks >>>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>>> >>>>>> >>>>> >>>> >>> >> >