Here I refer to failure in spark app. So When I restart , i see duplicate messages.
To replicate the scenario , i just do kill mysparkapp and then restart . On Wed, Jun 22, 2016 at 1:10 PM, Mich Talebzadeh <mich.talebza...@gmail.com> wrote: > As I see it you are using Spark streaming to read data from source through > Kafka. Your batch interval is 10 sec, so in that interval you have 10*300K > = 3Milion messages > > When you say there is failure are you referring to the failure in the > source or in Spark streaming app? > > HTH > > Dr Mich Talebzadeh > > > > LinkedIn * > https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw > <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* > > > > http://talebzadehmich.wordpress.com > > > > On 22 June 2016 at 08:09, sandesh deshmane <sandesh.v...@gmail.com> wrote: > >> Hi, >> >> I am writing spark streaming application which reads messages from Kafka. >> >> I am using checkpointing and write ahead logs ( WAL) to achieve fault >> tolerance . >> >> I have created batch size of 10 sec for reading messages from kafka. >> >> I read messages for kakfa and generate the count of messages as per >> values received from Kafka message. >> >> In case there is failure and my spark streaming application is restarted >> I see duplicate messages processed ( which is close to 2 batches) >> >> The problem that I have is per sec I get around 300k messages and In case >> application is restarted I see around 3-5 million duplicate counts. >> >> How to avoid such duplicates? >> >> what is best to way to recover from such failures ? >> >> Thanks >> Sandesh >> > >