We are going with checkpointing . we don't have identifier available to identify if the message is already processed or not . Even if we had it, then it will slow down the processing as we do get 300k messages per sec , so lookup will slow down.
Thanks Sandesh On Wed, Jun 22, 2016 at 3:28 PM, Jörn Franke <jornfra...@gmail.com> wrote: > > Spark Streamig does not guarantee exactly once for output action. It means > that one item is only processed in an RDD. > You can achieve at most once or at least once. > You could however do at least once (via checkpoing) and record which > messages have been proceed (some identifier available?) and do not re > process them .... You could also store (safely) what range has been already > processed etc > > Think about the business case if exactly once is needed or if it can be > replaced by one of the others. > Exactly once, it needed requires in any system including spark more effort > and usually the throughput is lower. A risk evaluation from a business > point of view has to be done anyway... > > > On 22 Jun 2016, at 09:09, sandesh deshmane <sandesh.v...@gmail.com> > wrote: > > > > Hi, > > > > I am writing spark streaming application which reads messages from Kafka. > > > > I am using checkpointing and write ahead logs ( WAL) to achieve fault > tolerance . > > > > I have created batch size of 10 sec for reading messages from kafka. > > > > I read messages for kakfa and generate the count of messages as per > values received from Kafka message. > > > > In case there is failure and my spark streaming application is restarted > I see duplicate messages processed ( which is close to 2 batches) > > > > The problem that I have is per sec I get around 300k messages and In > case application is restarted I see around 3-5 million duplicate counts. > > > > How to avoid such duplicates? > > > > what is best to way to recover from such failures ? > > > > Thanks > > Sandesh >