Re: how to avoid duplicate messages with spark streaming using checkpoint after restart in case of failure

Mich Talebzadeh Wed, 22 Jun 2016 00:41:28 -0700

As I see it you are using Spark streaming to read data from source through
Kafka. Your batch interval is 10 sec, so in that interval you have 10*300K
= 3Milion messages


When you say there is failure are you referring to the failure in the
source or in Spark streaming app?

HTH

Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com



On 22 June 2016 at 08:09, sandesh deshmane <sandesh.v...@gmail.com> wrote:

> Hi,
>
> I am writing spark streaming application which reads messages from Kafka.
>
> I am using checkpointing and write ahead logs ( WAL) to achieve fault
> tolerance .
>
> I have created batch size of 10 sec for reading messages from kafka.
>
> I read messages for kakfa and generate the count of messages as per values
> received from Kafka message.
>
> In case there is failure and my spark streaming application is restarted I
> see duplicate messages processed ( which is close to 2 batches)
>
> The problem that I have is per sec I get around 300k messages and In case
> application is restarted I see around 3-5 million duplicate counts.
>
> How to avoid such duplicates?
>
> what is best to way to recover from such failures ?
>
> Thanks
> Sandesh
>

Re: how to avoid duplicate messages with spark streaming using checkpoint after restart in case of failure

Reply via email to