Re: how to avoid duplicate messages with spark streaming using checkpoint after restart in case of failure

2016-06-22 Thread Jörn Franke

That is the cost of exactly once :) 

> On 22 Jun 2016, at 12:54, sandesh deshmane  wrote:
> 
> We are going with checkpointing . we don't have identifier available to 
> identify if the message is already processed or not .
> Even if we had it, then it will slow down the processing as we do get 300k 
> messages per sec , so lookup will slow down.
> 
> Thanks
> Sandesh
> 
>> On Wed, Jun 22, 2016 at 3:28 PM, Jörn Franke  wrote:
>> 
>> Spark Streamig does not guarantee exactly once for output action. It means 
>> that one item is only processed in an RDD.
>> You can achieve at most once or at least once.
>> You could however do at least once (via checkpoing) and record which 
>> messages have been proceed (some identifier available?) and do not re 
>> process them  You could also store (safely) what range has been already 
>> processed etc
>> 
>> Think about the business case if exactly once is needed or if it can be 
>> replaced by one of the others.
>> Exactly once, it needed requires in any system including spark more effort 
>> and usually the throughput is lower. A risk evaluation from a business point 
>> of view has to be done anyway...
>> 
>> > On 22 Jun 2016, at 09:09, sandesh deshmane  wrote:
>> >
>> > Hi,
>> >
>> > I am writing spark streaming application which reads messages from Kafka.
>> >
>> > I am using checkpointing and write ahead logs ( WAL) to achieve fault 
>> > tolerance .
>> >
>> > I have created batch size of 10 sec for reading messages from kafka.
>> >
>> > I read messages for kakfa and generate the count of messages as per values 
>> > received from Kafka message.
>> >
>> > In case there is failure and my spark streaming application is restarted I 
>> > see duplicate messages processed ( which is close to 2 batches)
>> >
>> > The problem that I have is per sec I get around 300k messages and In case 
>> > application is restarted I see around 3-5 million duplicate counts.
>> >
>> > How to avoid such duplicates?
>> >
>> > what is best to way to recover from such failures ?
>> >
>> > Thanks
>> > Sandesh
> 


Re: how to avoid duplicate messages with spark streaming using checkpoint after restart in case of failure

2016-06-22 Thread Cody Koeninger
The direct stream doesn't automagically give you exactly-once
semantics.  Indeed, you should be pretty suspicious of anything that
claims to give you end-to-end exactly-once semantics without any
additional work on your part.

To the original poster, have you read / watched the materials linked
from the page below?  That should clarify what your options are.

https://github.com/koeninger/kafka-exactly-once

On Wed, Jun 22, 2016 at 5:55 AM, Denys Cherepanin  wrote:
> Hi Sandesh,
>
> As I understand you are using "receiver based" approach to integrate kafka
> with spark streaming.
>
> Did you tried "direct" approach ? In this case offsets will be tracked by
> streaming app via check-pointing and you should achieve exactly-once
> semantics
>
> On Wed, Jun 22, 2016 at 5:58 AM, Jörn Franke  wrote:
>>
>>
>> Spark Streamig does not guarantee exactly once for output action. It means
>> that one item is only processed in an RDD.
>> You can achieve at most once or at least once.
>> You could however do at least once (via checkpoing) and record which
>> messages have been proceed (some identifier available?) and do not re
>> process them  You could also store (safely) what range has been already
>> processed etc
>>
>> Think about the business case if exactly once is needed or if it can be
>> replaced by one of the others.
>> Exactly once, it needed requires in any system including spark more effort
>> and usually the throughput is lower. A risk evaluation from a business point
>> of view has to be done anyway...
>>
>> > On 22 Jun 2016, at 09:09, sandesh deshmane 
>> > wrote:
>> >
>> > Hi,
>> >
>> > I am writing spark streaming application which reads messages from
>> > Kafka.
>> >
>> > I am using checkpointing and write ahead logs ( WAL) to achieve fault
>> > tolerance .
>> >
>> > I have created batch size of 10 sec for reading messages from kafka.
>> >
>> > I read messages for kakfa and generate the count of messages as per
>> > values received from Kafka message.
>> >
>> > In case there is failure and my spark streaming application is restarted
>> > I see duplicate messages processed ( which is close to 2 batches)
>> >
>> > The problem that I have is per sec I get around 300k messages and In
>> > case application is restarted I see around 3-5 million duplicate counts.
>> >
>> > How to avoid such duplicates?
>> >
>> > what is best to way to recover from such failures ?
>> >
>> > Thanks
>> > Sandesh
>>
>> -
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>
>
>
>
> --
> Yours faithfully, Denys Cherepanin

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: how to avoid duplicate messages with spark streaming using checkpoint after restart in case of failure

2016-06-22 Thread sandesh deshmane
We have not tried direct approach . We are using receiver based approach (
we use zookeepers to connect from spark)

We have around 20+ Kafka and some times we replace the kafka brokers ( they
go down ). So each time I need to change list at spark application and I
need to restart the streaming app.

Thanks
Sandesh

On Wed, Jun 22, 2016 at 4:25 PM, Denys Cherepanin 
wrote:

> Hi Sandesh,
>
> As I understand you are using "receiver based" approach to integrate kafka
> with spark streaming.
>
> Did you tried "direct" approach
> 
>  ?
> In this case offsets will be tracked by streaming app via check-pointing
> and you should achieve exactly-once semantics
>
> On Wed, Jun 22, 2016 at 5:58 AM, Jörn Franke  wrote:
>
>>
>> Spark Streamig does not guarantee exactly once for output action. It
>> means that one item is only processed in an RDD.
>> You can achieve at most once or at least once.
>> You could however do at least once (via checkpoing) and record which
>> messages have been proceed (some identifier available?) and do not re
>> process them  You could also store (safely) what range has been already
>> processed etc
>>
>> Think about the business case if exactly once is needed or if it can be
>> replaced by one of the others.
>> Exactly once, it needed requires in any system including spark more
>> effort and usually the throughput is lower. A risk evaluation from a
>> business point of view has to be done anyway...
>>
>> > On 22 Jun 2016, at 09:09, sandesh deshmane 
>> wrote:
>> >
>> > Hi,
>> >
>> > I am writing spark streaming application which reads messages from
>> Kafka.
>> >
>> > I am using checkpointing and write ahead logs ( WAL) to achieve fault
>> tolerance .
>> >
>> > I have created batch size of 10 sec for reading messages from kafka.
>> >
>> > I read messages for kakfa and generate the count of messages as per
>> values received from Kafka message.
>> >
>> > In case there is failure and my spark streaming application is
>> restarted I see duplicate messages processed ( which is close to 2 batches)
>> >
>> > The problem that I have is per sec I get around 300k messages and In
>> case application is restarted I see around 3-5 million duplicate counts.
>> >
>> > How to avoid such duplicates?
>> >
>> > what is best to way to recover from such failures ?
>> >
>> > Thanks
>> > Sandesh
>>
>> -
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>
>>
>
>
> --
> Yours faithfully, Denys Cherepanin
>


Re: how to avoid duplicate messages with spark streaming using checkpoint after restart in case of failure

2016-06-22 Thread Denys Cherepanin
Hi Sandesh,

As I understand you are using "receiver based" approach to integrate kafka
with spark streaming.

Did you tried "direct" approach

?
In this case offsets will be tracked by streaming app via check-pointing
and you should achieve exactly-once semantics

On Wed, Jun 22, 2016 at 5:58 AM, Jörn Franke  wrote:

>
> Spark Streamig does not guarantee exactly once for output action. It means
> that one item is only processed in an RDD.
> You can achieve at most once or at least once.
> You could however do at least once (via checkpoing) and record which
> messages have been proceed (some identifier available?) and do not re
> process them  You could also store (safely) what range has been already
> processed etc
>
> Think about the business case if exactly once is needed or if it can be
> replaced by one of the others.
> Exactly once, it needed requires in any system including spark more effort
> and usually the throughput is lower. A risk evaluation from a business
> point of view has to be done anyway...
>
> > On 22 Jun 2016, at 09:09, sandesh deshmane 
> wrote:
> >
> > Hi,
> >
> > I am writing spark streaming application which reads messages from Kafka.
> >
> > I am using checkpointing and write ahead logs ( WAL) to achieve fault
> tolerance .
> >
> > I have created batch size of 10 sec for reading messages from kafka.
> >
> > I read messages for kakfa and generate the count of messages as per
> values received from Kafka message.
> >
> > In case there is failure and my spark streaming application is restarted
> I see duplicate messages processed ( which is close to 2 batches)
> >
> > The problem that I have is per sec I get around 300k messages and In
> case application is restarted I see around 3-5 million duplicate counts.
> >
> > How to avoid such duplicates?
> >
> > what is best to way to recover from such failures ?
> >
> > Thanks
> > Sandesh
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


-- 
Yours faithfully, Denys Cherepanin


Re: how to avoid duplicate messages with spark streaming using checkpoint after restart in case of failure

2016-06-22 Thread sandesh deshmane
We are going with checkpointing . we don't have identifier available to
identify if the message is already processed or not .
Even if we had it, then it will slow down the processing as we do get 300k
messages per sec , so lookup will slow down.

Thanks
Sandesh

On Wed, Jun 22, 2016 at 3:28 PM, Jörn Franke  wrote:

>
> Spark Streamig does not guarantee exactly once for output action. It means
> that one item is only processed in an RDD.
> You can achieve at most once or at least once.
> You could however do at least once (via checkpoing) and record which
> messages have been proceed (some identifier available?) and do not re
> process them  You could also store (safely) what range has been already
> processed etc
>
> Think about the business case if exactly once is needed or if it can be
> replaced by one of the others.
> Exactly once, it needed requires in any system including spark more effort
> and usually the throughput is lower. A risk evaluation from a business
> point of view has to be done anyway...
>
> > On 22 Jun 2016, at 09:09, sandesh deshmane 
> wrote:
> >
> > Hi,
> >
> > I am writing spark streaming application which reads messages from Kafka.
> >
> > I am using checkpointing and write ahead logs ( WAL) to achieve fault
> tolerance .
> >
> > I have created batch size of 10 sec for reading messages from kafka.
> >
> > I read messages for kakfa and generate the count of messages as per
> values received from Kafka message.
> >
> > In case there is failure and my spark streaming application is restarted
> I see duplicate messages processed ( which is close to 2 batches)
> >
> > The problem that I have is per sec I get around 300k messages and In
> case application is restarted I see around 3-5 million duplicate counts.
> >
> > How to avoid such duplicates?
> >
> > what is best to way to recover from such failures ?
> >
> > Thanks
> > Sandesh
>


Re: how to avoid duplicate messages with spark streaming using checkpoint after restart in case of failure

2016-06-22 Thread Jörn Franke

Spark Streamig does not guarantee exactly once for output action. It means that 
one item is only processed in an RDD.
You can achieve at most once or at least once.
You could however do at least once (via checkpoing) and record which messages 
have been proceed (some identifier available?) and do not re process them  
You could also store (safely) what range has been already processed etc

Think about the business case if exactly once is needed or if it can be 
replaced by one of the others.
Exactly once, it needed requires in any system including spark more effort and 
usually the throughput is lower. A risk evaluation from a business point of 
view has to be done anyway...

> On 22 Jun 2016, at 09:09, sandesh deshmane  wrote:
> 
> Hi,
> 
> I am writing spark streaming application which reads messages from Kafka.
> 
> I am using checkpointing and write ahead logs ( WAL) to achieve fault 
> tolerance .
> 
> I have created batch size of 10 sec for reading messages from kafka.
> 
> I read messages for kakfa and generate the count of messages as per values 
> received from Kafka message.
> 
> In case there is failure and my spark streaming application is restarted I 
> see duplicate messages processed ( which is close to 2 batches)
> 
> The problem that I have is per sec I get around 300k messages and In case 
> application is restarted I see around 3-5 million duplicate counts.
> 
> How to avoid such duplicates?
> 
> what is best to way to recover from such failures ?
> 
> Thanks
> Sandesh

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: how to avoid duplicate messages with spark streaming using checkpoint after restart in case of failure

2016-06-22 Thread Mich Talebzadeh
Hi Sandesh,

Where these messages end up? Are they written to a sink (file, database etc)

What is the reason your app fails. Can that be remedied to reduce the
impact.

How do you identify that duplicates are sent and processed?

Cheers,

Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com



On 22 June 2016 at 10:38, sandesh deshmane  wrote:

> Mich Talebzadeh thanks for reply.
>
> we have retention policy of 4 hours for kafka messages and we have
> multiple other consumers which reads from kafka cluster. ( spark is one of
> them)
>
> we have timestamp in message, but we actually have multiple message with
> same time stamp. its very hard to differentiate.
>
> But we have some offset with kafka and consumer keeps tracks for offset
> and Consumer should read from offset.
>
> so its problem with kafka , not with Spark?
>
> Any way to rectify this?
> we don't have id in messages. if we keep id also , then we keep map where
> we will put the ids and mark them processed, but at run time i need to do
> that lookup and for us , the number of messages is very high, so look up
> will ad up in processing time ?
>
> Thanks
> Sandesh Deshmane
>
> On Wed, Jun 22, 2016 at 2:36 PM, Mich Talebzadeh <
> mich.talebza...@gmail.com> wrote:
>
>> Yes this is more of Kafka issue as Kafka send the messages again.
>>
>> In your topic do messages come with an ID or timestamp where you can
>> reject them if they have already been processed.
>>
>> In other words do you have a way what message was last processed via
>> Spark before failing.
>>
>> You can of course  reset Kafka retention time and purge it
>>
>>
>> ${KAFKA_HOME}/bin/kafka-topics.sh --zookeeper rhes564:2181 --alter
>> --topic newtopic --config retention.ms=1000
>>
>>
>>
>> Wait for a minute and then reset back
>>
>>
>>
>> ${KAFKA_HOME}/bin/kafka-topics.sh --zookeeper rhes564:2181 --alter
>> --topic newtopic --config retention.ms=60
>>
>>
>>
>> ${KAFKA_HOME}/bin/kafka-console-consumer.sh --zookeeper rhes564:2181
>> --from-beginning --topic newtopic
>>
>>
>>
>> HTH
>>
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn * 
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> *
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>>
>> On 22 June 2016 at 09:57, sandesh deshmane 
>> wrote:
>>
>>> Here I refer to failure in spark app.
>>>
>>> So When I restart , i see duplicate messages.
>>>
>>> To replicate the scenario , i just do kill mysparkapp and then restart .
>>>
>>> On Wed, Jun 22, 2016 at 1:10 PM, Mich Talebzadeh <
>>> mich.talebza...@gmail.com> wrote:
>>>
 As I see it you are using Spark streaming to read data from source
 through Kafka. Your batch interval is 10 sec, so in that interval you have
 10*300K = 3Milion messages

 When you say there is failure are you referring to the failure in the
 source or in Spark streaming app?

 HTH

 Dr Mich Talebzadeh



 LinkedIn * 
 https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
 *



 http://talebzadehmich.wordpress.com



 On 22 June 2016 at 08:09, sandesh deshmane 
 wrote:

> Hi,
>
> I am writing spark streaming application which reads messages from
> Kafka.
>
> I am using checkpointing and write ahead logs ( WAL) to achieve fault
> tolerance .
>
> I have created batch size of 10 sec for reading messages from kafka.
>
> I read messages for kakfa and generate the count of messages as per
> values received from Kafka message.
>
> In case there is failure and my spark streaming application is
> restarted I see duplicate messages processed ( which is close to 2 
> batches)
>
> The problem that I have is per sec I get around 300k messages and In
> case application is restarted I see around 3-5 million duplicate counts.
>
> How to avoid such duplicates?
>
> what is best to way to recover from such failures ?
>
> Thanks
> Sandesh
>


>>>
>>
>


Re: how to avoid duplicate messages with spark streaming using checkpoint after restart in case of failure

2016-06-22 Thread sandesh deshmane
Mich Talebzadeh thanks for reply.

we have retention policy of 4 hours for kafka messages and we have multiple
other consumers which reads from kafka cluster. ( spark is one of them)

we have timestamp in message, but we actually have multiple message with
same time stamp. its very hard to differentiate.

But we have some offset with kafka and consumer keeps tracks for offset and
Consumer should read from offset.

so its problem with kafka , not with Spark?

Any way to rectify this?
we don't have id in messages. if we keep id also , then we keep map where
we will put the ids and mark them processed, but at run time i need to do
that lookup and for us , the number of messages is very high, so look up
will ad up in processing time ?

Thanks
Sandesh Deshmane

On Wed, Jun 22, 2016 at 2:36 PM, Mich Talebzadeh 
wrote:

> Yes this is more of Kafka issue as Kafka send the messages again.
>
> In your topic do messages come with an ID or timestamp where you can
> reject them if they have already been processed.
>
> In other words do you have a way what message was last processed via Spark
> before failing.
>
> You can of course  reset Kafka retention time and purge it
>
>
> ${KAFKA_HOME}/bin/kafka-topics.sh --zookeeper rhes564:2181 --alter --topic
> newtopic --config retention.ms=1000
>
>
>
> Wait for a minute and then reset back
>
>
>
> ${KAFKA_HOME}/bin/kafka-topics.sh --zookeeper rhes564:2181 --alter --topic
> newtopic --config retention.ms=60
>
>
>
> ${KAFKA_HOME}/bin/kafka-console-consumer.sh --zookeeper rhes564:2181
> --from-beginning --topic newtopic
>
>
>
> HTH
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> On 22 June 2016 at 09:57, sandesh deshmane  wrote:
>
>> Here I refer to failure in spark app.
>>
>> So When I restart , i see duplicate messages.
>>
>> To replicate the scenario , i just do kill mysparkapp and then restart .
>>
>> On Wed, Jun 22, 2016 at 1:10 PM, Mich Talebzadeh <
>> mich.talebza...@gmail.com> wrote:
>>
>>> As I see it you are using Spark streaming to read data from source
>>> through Kafka. Your batch interval is 10 sec, so in that interval you have
>>> 10*300K = 3Milion messages
>>>
>>> When you say there is failure are you referring to the failure in the
>>> source or in Spark streaming app?
>>>
>>> HTH
>>>
>>> Dr Mich Talebzadeh
>>>
>>>
>>>
>>> LinkedIn * 
>>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>> *
>>>
>>>
>>>
>>> http://talebzadehmich.wordpress.com
>>>
>>>
>>>
>>> On 22 June 2016 at 08:09, sandesh deshmane 
>>> wrote:
>>>
 Hi,

 I am writing spark streaming application which reads messages from
 Kafka.

 I am using checkpointing and write ahead logs ( WAL) to achieve fault
 tolerance .

 I have created batch size of 10 sec for reading messages from kafka.

 I read messages for kakfa and generate the count of messages as per
 values received from Kafka message.

 In case there is failure and my spark streaming application is
 restarted I see duplicate messages processed ( which is close to 2 batches)

 The problem that I have is per sec I get around 300k messages and In
 case application is restarted I see around 3-5 million duplicate counts.

 How to avoid such duplicates?

 what is best to way to recover from such failures ?

 Thanks
 Sandesh

>>>
>>>
>>
>


Re: how to avoid duplicate messages with spark streaming using checkpoint after restart in case of failure

2016-06-22 Thread Mich Talebzadeh
Yes this is more of Kafka issue as Kafka send the messages again.

In your topic do messages come with an ID or timestamp where you can reject
them if they have already been processed.

In other words do you have a way what message was last processed via Spark
before failing.

You can of course  reset Kafka retention time and purge it


${KAFKA_HOME}/bin/kafka-topics.sh --zookeeper rhes564:2181 --alter --topic
newtopic --config retention.ms=1000



Wait for a minute and then reset back



${KAFKA_HOME}/bin/kafka-topics.sh --zookeeper rhes564:2181 --alter --topic
newtopic --config retention.ms=60



${KAFKA_HOME}/bin/kafka-console-consumer.sh --zookeeper rhes564:2181
--from-beginning --topic newtopic



HTH


Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com



On 22 June 2016 at 09:57, sandesh deshmane  wrote:

> Here I refer to failure in spark app.
>
> So When I restart , i see duplicate messages.
>
> To replicate the scenario , i just do kill mysparkapp and then restart .
>
> On Wed, Jun 22, 2016 at 1:10 PM, Mich Talebzadeh <
> mich.talebza...@gmail.com> wrote:
>
>> As I see it you are using Spark streaming to read data from source
>> through Kafka. Your batch interval is 10 sec, so in that interval you have
>> 10*300K = 3Milion messages
>>
>> When you say there is failure are you referring to the failure in the
>> source or in Spark streaming app?
>>
>> HTH
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn * 
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> *
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>>
>> On 22 June 2016 at 08:09, sandesh deshmane 
>> wrote:
>>
>>> Hi,
>>>
>>> I am writing spark streaming application which reads messages from Kafka.
>>>
>>> I am using checkpointing and write ahead logs ( WAL) to achieve fault
>>> tolerance .
>>>
>>> I have created batch size of 10 sec for reading messages from kafka.
>>>
>>> I read messages for kakfa and generate the count of messages as per
>>> values received from Kafka message.
>>>
>>> In case there is failure and my spark streaming application is restarted
>>> I see duplicate messages processed ( which is close to 2 batches)
>>>
>>> The problem that I have is per sec I get around 300k messages and In
>>> case application is restarted I see around 3-5 million duplicate counts.
>>>
>>> How to avoid such duplicates?
>>>
>>> what is best to way to recover from such failures ?
>>>
>>> Thanks
>>> Sandesh
>>>
>>
>>
>


Re: how to avoid duplicate messages with spark streaming using checkpoint after restart in case of failure

2016-06-22 Thread sandesh deshmane
Here I refer to failure in spark app.

So When I restart , i see duplicate messages.

To replicate the scenario , i just do kill mysparkapp and then restart .

On Wed, Jun 22, 2016 at 1:10 PM, Mich Talebzadeh 
wrote:

> As I see it you are using Spark streaming to read data from source through
> Kafka. Your batch interval is 10 sec, so in that interval you have 10*300K
> = 3Milion messages
>
> When you say there is failure are you referring to the failure in the
> source or in Spark streaming app?
>
> HTH
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> On 22 June 2016 at 08:09, sandesh deshmane  wrote:
>
>> Hi,
>>
>> I am writing spark streaming application which reads messages from Kafka.
>>
>> I am using checkpointing and write ahead logs ( WAL) to achieve fault
>> tolerance .
>>
>> I have created batch size of 10 sec for reading messages from kafka.
>>
>> I read messages for kakfa and generate the count of messages as per
>> values received from Kafka message.
>>
>> In case there is failure and my spark streaming application is restarted
>> I see duplicate messages processed ( which is close to 2 batches)
>>
>> The problem that I have is per sec I get around 300k messages and In case
>> application is restarted I see around 3-5 million duplicate counts.
>>
>> How to avoid such duplicates?
>>
>> what is best to way to recover from such failures ?
>>
>> Thanks
>> Sandesh
>>
>
>


Re: how to avoid duplicate messages with spark streaming using checkpoint after restart in case of failure

2016-06-22 Thread Mich Talebzadeh
As I see it you are using Spark streaming to read data from source through
Kafka. Your batch interval is 10 sec, so in that interval you have 10*300K
= 3Milion messages

When you say there is failure are you referring to the failure in the
source or in Spark streaming app?

HTH

Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com



On 22 June 2016 at 08:09, sandesh deshmane  wrote:

> Hi,
>
> I am writing spark streaming application which reads messages from Kafka.
>
> I am using checkpointing and write ahead logs ( WAL) to achieve fault
> tolerance .
>
> I have created batch size of 10 sec for reading messages from kafka.
>
> I read messages for kakfa and generate the count of messages as per values
> received from Kafka message.
>
> In case there is failure and my spark streaming application is restarted I
> see duplicate messages processed ( which is close to 2 batches)
>
> The problem that I have is per sec I get around 300k messages and In case
> application is restarted I see around 3-5 million duplicate counts.
>
> How to avoid such duplicates?
>
> what is best to way to recover from such failures ?
>
> Thanks
> Sandesh
>