That is the cost of exactly once :)
> On 22 Jun 2016, at 12:54, sandesh deshmane <sandesh.v...@gmail.com> wrote: > > We are going with checkpointing . we don't have identifier available to > identify if the message is already processed or not . > Even if we had it, then it will slow down the processing as we do get 300k > messages per sec , so lookup will slow down. > > Thanks > Sandesh > >> On Wed, Jun 22, 2016 at 3:28 PM, Jörn Franke <jornfra...@gmail.com> wrote: >> >> Spark Streamig does not guarantee exactly once for output action. It means >> that one item is only processed in an RDD. >> You can achieve at most once or at least once. >> You could however do at least once (via checkpoing) and record which >> messages have been proceed (some identifier available?) and do not re >> process them .... You could also store (safely) what range has been already >> processed etc >> >> Think about the business case if exactly once is needed or if it can be >> replaced by one of the others. >> Exactly once, it needed requires in any system including spark more effort >> and usually the throughput is lower. A risk evaluation from a business point >> of view has to be done anyway... >> >> > On 22 Jun 2016, at 09:09, sandesh deshmane <sandesh.v...@gmail.com> wrote: >> > >> > Hi, >> > >> > I am writing spark streaming application which reads messages from Kafka. >> > >> > I am using checkpointing and write ahead logs ( WAL) to achieve fault >> > tolerance . >> > >> > I have created batch size of 10 sec for reading messages from kafka. >> > >> > I read messages for kakfa and generate the count of messages as per values >> > received from Kafka message. >> > >> > In case there is failure and my spark streaming application is restarted I >> > see duplicate messages processed ( which is close to 2 batches) >> > >> > The problem that I have is per sec I get around 300k messages and In case >> > application is restarted I see around 3-5 million duplicate counts. >> > >> > How to avoid such duplicates? >> > >> > what is best to way to recover from such failures ? >> > >> > Thanks >> > Sandesh >