As I see it you are using Spark streaming to read data from source through Kafka. Your batch interval is 10 sec, so in that interval you have 10*300K = 3Milion messages
When you say there is failure are you referring to the failure in the source or in Spark streaming app? HTH Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* http://talebzadehmich.wordpress.com On 22 June 2016 at 08:09, sandesh deshmane <sandesh.v...@gmail.com> wrote: > Hi, > > I am writing spark streaming application which reads messages from Kafka. > > I am using checkpointing and write ahead logs ( WAL) to achieve fault > tolerance . > > I have created batch size of 10 sec for reading messages from kafka. > > I read messages for kakfa and generate the count of messages as per values > received from Kafka message. > > In case there is failure and my spark streaming application is restarted I > see duplicate messages processed ( which is close to 2 batches) > > The problem that I have is per sec I get around 300k messages and In case > application is restarted I see around 3-5 million duplicate counts. > > How to avoid such duplicates? > > what is best to way to recover from such failures ? > > Thanks > Sandesh >