Hi,

I am writing spark streaming application which reads messages from Kafka.

I am using checkpointing and write ahead logs ( WAL) to achieve fault
tolerance .

I have created batch size of 10 sec for reading messages from kafka.

I read messages for kakfa and generate the count of messages as per values
received from Kafka message.

In case there is failure and my spark streaming application is restarted I
see duplicate messages processed ( which is close to 2 batches)

The problem that I have is per sec I get around 300k messages and In case
application is restarted I see around 3-5 million duplicate counts.

How to avoid such duplicates?

what is best to way to recover from such failures ?

Thanks
Sandesh

Reply via email to