Re: how to avoid duplicate messages with spark streaming using checkpoint after restart in case of failure

sandesh deshmane Wed, 22 Jun 2016 01:58:15 -0700

Here I refer to failure in spark app.

So When I restart , i see duplicate messages.


To replicate the scenario , i just do kill mysparkapp and then restart .

On Wed, Jun 22, 2016 at 1:10 PM, Mich Talebzadeh <mich.talebza...@gmail.com>
wrote:

> As I see it you are using Spark streaming to read data from source through
> Kafka. Your batch interval is 10 sec, so in that interval you have 10*300K
> = 3Milion messages
>
> When you say there is failure are you referring to the failure in the
> source or in Spark streaming app?
>
> HTH
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> On 22 June 2016 at 08:09, sandesh deshmane <sandesh.v...@gmail.com> wrote:
>
>> Hi,
>>
>> I am writing spark streaming application which reads messages from Kafka.
>>
>> I am using checkpointing and write ahead logs ( WAL) to achieve fault
>> tolerance .
>>
>> I have created batch size of 10 sec for reading messages from kafka.
>>
>> I read messages for kakfa and generate the count of messages as per
>> values received from Kafka message.
>>
>> In case there is failure and my spark streaming application is restarted
>> I see duplicate messages processed ( which is close to 2 batches)
>>
>> The problem that I have is per sec I get around 300k messages and In case
>> application is restarted I see around 3-5 million duplicate counts.
>>
>> How to avoid such duplicates?
>>
>> what is best to way to recover from such failures ?
>>
>> Thanks
>> Sandesh
>>
>
>

Re: how to avoid duplicate messages with spark streaming using checkpoint after restart in case of failure

Reply via email to