I noticed that 1 wouldn't be a problem, because you'll save the BloomFilter
in the state.

For 2, you would keep a Map of UUID's to the timestamp of when you saw
them. If the UUID exists in the map, then you wouldn't increase the count.
If the timestamp of a UUID expires, you would remove it from the map. The
reason we remove from the map is to keep a bounded amount of space. It'll
probably take a lot more space than the BloomFilter though depending on
your data volume.

On Wed, Jan 25, 2017 at 11:24 AM, shyla deshpande <deshpandesh...@gmail.com>
wrote:

> In the previous email you gave me 2 solutions
> 1. Bloom filter --> problem in repopulating the bloom filter on restarts
> 2. keeping the state of the unique ids
>
> Please elaborate on 2.
>
>
>
> On Wed, Jan 25, 2017 at 10:53 AM, Burak Yavuz <brk...@gmail.com> wrote:
>
>> I don't have any sample code, but on a high level:
>>
>> My state would be: (Long, BloomFilter[UUID])
>> In the update function, my value will be the UUID of the record, since
>> the word itself is the key.
>> I'll ask my BloomFilter if I've seen this UUID before. If not increase
>> count, also add to Filter.
>>
>> Does that make sense?
>>
>>
>> On Wed, Jan 25, 2017 at 9:28 AM, shyla deshpande <
>> deshpandesh...@gmail.com> wrote:
>>
>>> Hi Burak,
>>> Thanks for the response. Can you please elaborate on your idea of
>>> storing the state of the unique ids.
>>> Do you have any sample code or links I can refer to.
>>> Thanks
>>>
>>> On Wed, Jan 25, 2017 at 9:13 AM, Burak Yavuz <brk...@gmail.com> wrote:
>>>
>>>> Off the top of my head... (Each may have it's own issues)
>>>>
>>>> If upstream you add a uniqueId to all your records, then you may use a
>>>> BloomFilter to approximate if you've seen a row before.
>>>> The problem I can see with that approach is how to repopulate the bloom
>>>> filter on restarts.
>>>>
>>>> If you are certain that you're not going to reprocess some data after a
>>>> certain time, i.e. there is no way I'm going to get the same data in 2
>>>> hours, it may only happen in the last 2 hours, then you may also keep the
>>>> state of uniqueId's as well, and then age them out after a certain time.
>>>>
>>>>
>>>> Best,
>>>> Burak
>>>>
>>>> On Tue, Jan 24, 2017 at 9:53 PM, shyla deshpande <
>>>> deshpandesh...@gmail.com> wrote:
>>>>
>>>>> Please share your thoughts.....
>>>>>
>>>>> On Tue, Jan 24, 2017 at 4:01 PM, shyla deshpande <
>>>>> deshpandesh...@gmail.com> wrote:
>>>>>
>>>>>>
>>>>>>
>>>>>> On Tue, Jan 24, 2017 at 9:44 AM, shyla deshpande <
>>>>>> deshpandesh...@gmail.com> wrote:
>>>>>>
>>>>>>> My streaming application stores lot of aggregations using
>>>>>>> mapWithState.
>>>>>>>
>>>>>>> I want to know what are all the possible ways I can make it
>>>>>>> idempotent.
>>>>>>>
>>>>>>> Please share your views.
>>>>>>>
>>>>>>> Thanks
>>>>>>>
>>>>>>> On Mon, Jan 23, 2017 at 5:41 PM, shyla deshpande <
>>>>>>> deshpandesh...@gmail.com> wrote:
>>>>>>>
>>>>>>>> In a Wordcount application which  stores the count of all the words
>>>>>>>> input so far using mapWithState.  How do I make sure my counts are not
>>>>>>>> messed up if I happen to read a line more than once?
>>>>>>>>
>>>>>>>> Appreciate your response.
>>>>>>>>
>>>>>>>> Thanks
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Reply via email to