Re: how to avoid duplicate messages with spark streaming using checkpoint after restart in case of failure

Jörn Franke Wed, 22 Jun 2016 03:04:59 -0700

Spark Streamig does not guarantee exactly once for output action. It means that 
one item is only processed in an RDD.
You can achieve at most once or at least once.
You could however do at least once (via checkpoing) and record which messages 
have been proceed (some identifier available?) and do not re process them .... 
You could also store (safely) what range has been already processed etc


Think about the business case if exactly once is needed or if it can be 
replaced by one of the others.
Exactly once, it needed requires in any system including spark more effort and 
usually the throughput is lower. A risk evaluation from a business point of 
view has to be done anyway...

> On 22 Jun 2016, at 09:09, sandesh deshmane <sandesh.v...@gmail.com> wrote:
> 
> Hi,
> 
> I am writing spark streaming application which reads messages from Kafka.
> 
> I am using checkpointing and write ahead logs ( WAL) to achieve fault 
> tolerance .
> 
> I have created batch size of 10 sec for reading messages from kafka.
> 
> I read messages for kakfa and generate the count of messages as per values 
> received from Kafka message.
> 
> In case there is failure and my spark streaming application is restarted I 
> see duplicate messages processed ( which is close to 2 batches)
> 
> The problem that I have is per sec I get around 300k messages and In case 
> application is restarted I see around 3-5 million duplicate counts.
> 
> How to avoid such duplicates?
> 
> what is best to way to recover from such failures ?
> 
> Thanks
> Sandesh

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: how to avoid duplicate messages with spark streaming using checkpoint after restart in case of failure

Reply via email to