Re: how to avoid duplicate messages with spark streaming using checkpoint after restart in case of failure

Jörn Franke Wed, 22 Jun 2016 09:09:03 -0700

That is the cost of exactly once :)


> On 22 Jun 2016, at 12:54, sandesh deshmane <sandesh.v...@gmail.com> wrote:
> 
> We are going with checkpointing . we don't have identifier available to 
> identify if the message is already processed or not .
> Even if we had it, then it will slow down the processing as we do get 300k 
> messages per sec , so lookup will slow down.
> 
> Thanks
> Sandesh
> 
>> On Wed, Jun 22, 2016 at 3:28 PM, Jörn Franke <jornfra...@gmail.com> wrote:
>> 
>> Spark Streamig does not guarantee exactly once for output action. It means 
>> that one item is only processed in an RDD.
>> You can achieve at most once or at least once.
>> You could however do at least once (via checkpoing) and record which 
>> messages have been proceed (some identifier available?) and do not re 
>> process them .... You could also store (safely) what range has been already 
>> processed etc
>> 
>> Think about the business case if exactly once is needed or if it can be 
>> replaced by one of the others.
>> Exactly once, it needed requires in any system including spark more effort 
>> and usually the throughput is lower. A risk evaluation from a business point 
>> of view has to be done anyway...
>> 
>> > On 22 Jun 2016, at 09:09, sandesh deshmane <sandesh.v...@gmail.com> wrote:
>> >
>> > Hi,
>> >
>> > I am writing spark streaming application which reads messages from Kafka.
>> >
>> > I am using checkpointing and write ahead logs ( WAL) to achieve fault 
>> > tolerance .
>> >
>> > I have created batch size of 10 sec for reading messages from kafka.
>> >
>> > I read messages for kakfa and generate the count of messages as per values 
>> > received from Kafka message.
>> >
>> > In case there is failure and my spark streaming application is restarted I 
>> > see duplicate messages processed ( which is close to 2 batches)
>> >
>> > The problem that I have is per sec I get around 300k messages and In case 
>> > application is restarted I see around 3-5 million duplicate counts.
>> >
>> > How to avoid such duplicates?
>> >
>> > what is best to way to recover from such failures ?
>> >
>> > Thanks
>> > Sandesh
>

Re: how to avoid duplicate messages with spark streaming using checkpoint after restart in case of failure

Reply via email to