Thanks Burak. I do want accuracy, that is why I want to make it idempotent.
I will try out your 2nd solution.

On Wed, Jan 25, 2017 at 12:27 PM, Burak Yavuz <brk...@gmail.com> wrote:

> Yes you may. Depends on if you want exact values or if you're okay with
> approximations. With Big Data, generally you would be okay with
> approximations. Try both out, see what scales/works with your dataset.
> Maybe you may handle the second implementation.
>
> On Wed, Jan 25, 2017 at 12:23 PM, shyla deshpande <
> deshpandesh...@gmail.com> wrote:
>
>> Thanks Burak. But with BloomFilter, won't I be getting a false poisitve?
>>
>> On Wed, Jan 25, 2017 at 11:28 AM, Burak Yavuz <brk...@gmail.com> wrote:
>>
>>> I noticed that 1 wouldn't be a problem, because you'll save the
>>> BloomFilter in the state.
>>>
>>> For 2, you would keep a Map of UUID's to the timestamp of when you saw
>>> them. If the UUID exists in the map, then you wouldn't increase the count.
>>> If the timestamp of a UUID expires, you would remove it from the map. The
>>> reason we remove from the map is to keep a bounded amount of space. It'll
>>> probably take a lot more space than the BloomFilter though depending on
>>> your data volume.
>>>
>>> On Wed, Jan 25, 2017 at 11:24 AM, shyla deshpande <
>>> deshpandesh...@gmail.com> wrote:
>>>
>>>> In the previous email you gave me 2 solutions
>>>> 1. Bloom filter --> problem in repopulating the bloom filter on
>>>> restarts
>>>> 2. keeping the state of the unique ids
>>>>
>>>> Please elaborate on 2.
>>>>
>>>>
>>>>
>>>> On Wed, Jan 25, 2017 at 10:53 AM, Burak Yavuz <brk...@gmail.com> wrote:
>>>>
>>>>> I don't have any sample code, but on a high level:
>>>>>
>>>>> My state would be: (Long, BloomFilter[UUID])
>>>>> In the update function, my value will be the UUID of the record, since
>>>>> the word itself is the key.
>>>>> I'll ask my BloomFilter if I've seen this UUID before. If not increase
>>>>> count, also add to Filter.
>>>>>
>>>>> Does that make sense?
>>>>>
>>>>>
>>>>> On Wed, Jan 25, 2017 at 9:28 AM, shyla deshpande <
>>>>> deshpandesh...@gmail.com> wrote:
>>>>>
>>>>>> Hi Burak,
>>>>>> Thanks for the response. Can you please elaborate on your idea of
>>>>>> storing the state of the unique ids.
>>>>>> Do you have any sample code or links I can refer to.
>>>>>> Thanks
>>>>>>
>>>>>> On Wed, Jan 25, 2017 at 9:13 AM, Burak Yavuz <brk...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Off the top of my head... (Each may have it's own issues)
>>>>>>>
>>>>>>> If upstream you add a uniqueId to all your records, then you may use
>>>>>>> a BloomFilter to approximate if you've seen a row before.
>>>>>>> The problem I can see with that approach is how to repopulate the
>>>>>>> bloom filter on restarts.
>>>>>>>
>>>>>>> If you are certain that you're not going to reprocess some data
>>>>>>> after a certain time, i.e. there is no way I'm going to get the same 
>>>>>>> data
>>>>>>> in 2 hours, it may only happen in the last 2 hours, then you may also 
>>>>>>> keep
>>>>>>> the state of uniqueId's as well, and then age them out after a certain 
>>>>>>> time.
>>>>>>>
>>>>>>>
>>>>>>> Best,
>>>>>>> Burak
>>>>>>>
>>>>>>> On Tue, Jan 24, 2017 at 9:53 PM, shyla deshpande <
>>>>>>> deshpandesh...@gmail.com> wrote:
>>>>>>>
>>>>>>>> Please share your thoughts.....
>>>>>>>>
>>>>>>>> On Tue, Jan 24, 2017 at 4:01 PM, shyla deshpande <
>>>>>>>> deshpandesh...@gmail.com> wrote:
>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Tue, Jan 24, 2017 at 9:44 AM, shyla deshpande <
>>>>>>>>> deshpandesh...@gmail.com> wrote:
>>>>>>>>>
>>>>>>>>>> My streaming application stores lot of aggregations using
>>>>>>>>>> mapWithState.
>>>>>>>>>>
>>>>>>>>>> I want to know what are all the possible ways I can make it
>>>>>>>>>> idempotent.
>>>>>>>>>>
>>>>>>>>>> Please share your views.
>>>>>>>>>>
>>>>>>>>>> Thanks
>>>>>>>>>>
>>>>>>>>>> On Mon, Jan 23, 2017 at 5:41 PM, shyla deshpande <
>>>>>>>>>> deshpandesh...@gmail.com> wrote:
>>>>>>>>>>
>>>>>>>>>>> In a Wordcount application which  stores the count of all the
>>>>>>>>>>> words input so far using mapWithState.  How do I make sure my 
>>>>>>>>>>> counts are
>>>>>>>>>>> not messed up if I happen to read a line more than once?
>>>>>>>>>>>
>>>>>>>>>>> Appreciate your response.
>>>>>>>>>>>
>>>>>>>>>>> Thanks
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Reply via email to