Yes you may. Depends on if you want exact values or if you're okay with
approximations. With Big Data, generally you would be okay with
approximations. Try both out, see what scales/works with your dataset.
Maybe you may handle the second implementation.

On Wed, Jan 25, 2017 at 12:23 PM, shyla deshpande <deshpandesh...@gmail.com>
wrote:

> Thanks Burak. But with BloomFilter, won't I be getting a false poisitve?
>
> On Wed, Jan 25, 2017 at 11:28 AM, Burak Yavuz <brk...@gmail.com> wrote:
>
>> I noticed that 1 wouldn't be a problem, because you'll save the
>> BloomFilter in the state.
>>
>> For 2, you would keep a Map of UUID's to the timestamp of when you saw
>> them. If the UUID exists in the map, then you wouldn't increase the count.
>> If the timestamp of a UUID expires, you would remove it from the map. The
>> reason we remove from the map is to keep a bounded amount of space. It'll
>> probably take a lot more space than the BloomFilter though depending on
>> your data volume.
>>
>> On Wed, Jan 25, 2017 at 11:24 AM, shyla deshpande <
>> deshpandesh...@gmail.com> wrote:
>>
>>> In the previous email you gave me 2 solutions
>>> 1. Bloom filter --> problem in repopulating the bloom filter on restarts
>>> 2. keeping the state of the unique ids
>>>
>>> Please elaborate on 2.
>>>
>>>
>>>
>>> On Wed, Jan 25, 2017 at 10:53 AM, Burak Yavuz <brk...@gmail.com> wrote:
>>>
>>>> I don't have any sample code, but on a high level:
>>>>
>>>> My state would be: (Long, BloomFilter[UUID])
>>>> In the update function, my value will be the UUID of the record, since
>>>> the word itself is the key.
>>>> I'll ask my BloomFilter if I've seen this UUID before. If not increase
>>>> count, also add to Filter.
>>>>
>>>> Does that make sense?
>>>>
>>>>
>>>> On Wed, Jan 25, 2017 at 9:28 AM, shyla deshpande <
>>>> deshpandesh...@gmail.com> wrote:
>>>>
>>>>> Hi Burak,
>>>>> Thanks for the response. Can you please elaborate on your idea of
>>>>> storing the state of the unique ids.
>>>>> Do you have any sample code or links I can refer to.
>>>>> Thanks
>>>>>
>>>>> On Wed, Jan 25, 2017 at 9:13 AM, Burak Yavuz <brk...@gmail.com> wrote:
>>>>>
>>>>>> Off the top of my head... (Each may have it's own issues)
>>>>>>
>>>>>> If upstream you add a uniqueId to all your records, then you may use
>>>>>> a BloomFilter to approximate if you've seen a row before.
>>>>>> The problem I can see with that approach is how to repopulate the
>>>>>> bloom filter on restarts.
>>>>>>
>>>>>> If you are certain that you're not going to reprocess some data after
>>>>>> a certain time, i.e. there is no way I'm going to get the same data in 2
>>>>>> hours, it may only happen in the last 2 hours, then you may also keep the
>>>>>> state of uniqueId's as well, and then age them out after a certain time.
>>>>>>
>>>>>>
>>>>>> Best,
>>>>>> Burak
>>>>>>
>>>>>> On Tue, Jan 24, 2017 at 9:53 PM, shyla deshpande <
>>>>>> deshpandesh...@gmail.com> wrote:
>>>>>>
>>>>>>> Please share your thoughts.....
>>>>>>>
>>>>>>> On Tue, Jan 24, 2017 at 4:01 PM, shyla deshpande <
>>>>>>> deshpandesh...@gmail.com> wrote:
>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Tue, Jan 24, 2017 at 9:44 AM, shyla deshpande <
>>>>>>>> deshpandesh...@gmail.com> wrote:
>>>>>>>>
>>>>>>>>> My streaming application stores lot of aggregations using
>>>>>>>>> mapWithState.
>>>>>>>>>
>>>>>>>>> I want to know what are all the possible ways I can make it
>>>>>>>>> idempotent.
>>>>>>>>>
>>>>>>>>> Please share your views.
>>>>>>>>>
>>>>>>>>> Thanks
>>>>>>>>>
>>>>>>>>> On Mon, Jan 23, 2017 at 5:41 PM, shyla deshpande <
>>>>>>>>> deshpandesh...@gmail.com> wrote:
>>>>>>>>>
>>>>>>>>>> In a Wordcount application which  stores the count of all the
>>>>>>>>>> words input so far using mapWithState.  How do I make sure my counts 
>>>>>>>>>> are
>>>>>>>>>> not messed up if I happen to read a line more than once?
>>>>>>>>>>
>>>>>>>>>> Appreciate your response.
>>>>>>>>>>
>>>>>>>>>> Thanks
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Reply via email to