The direct stream doesn't automagically give you exactly-once semantics. Indeed, you should be pretty suspicious of anything that claims to give you end-to-end exactly-once semantics without any additional work on your part.
To the original poster, have you read / watched the materials linked from the page below? That should clarify what your options are. https://github.com/koeninger/kafka-exactly-once On Wed, Jun 22, 2016 at 5:55 AM, Denys Cherepanin <denusk...@gmail.com> wrote: > Hi Sandesh, > > As I understand you are using "receiver based" approach to integrate kafka > with spark streaming. > > Did you tried "direct" approach ? In this case offsets will be tracked by > streaming app via check-pointing and you should achieve exactly-once > semantics > > On Wed, Jun 22, 2016 at 5:58 AM, Jörn Franke <jornfra...@gmail.com> wrote: >> >> >> Spark Streamig does not guarantee exactly once for output action. It means >> that one item is only processed in an RDD. >> You can achieve at most once or at least once. >> You could however do at least once (via checkpoing) and record which >> messages have been proceed (some identifier available?) and do not re >> process them .... You could also store (safely) what range has been already >> processed etc >> >> Think about the business case if exactly once is needed or if it can be >> replaced by one of the others. >> Exactly once, it needed requires in any system including spark more effort >> and usually the throughput is lower. A risk evaluation from a business point >> of view has to be done anyway... >> >> > On 22 Jun 2016, at 09:09, sandesh deshmane <sandesh.v...@gmail.com> >> > wrote: >> > >> > Hi, >> > >> > I am writing spark streaming application which reads messages from >> > Kafka. >> > >> > I am using checkpointing and write ahead logs ( WAL) to achieve fault >> > tolerance . >> > >> > I have created batch size of 10 sec for reading messages from kafka. >> > >> > I read messages for kakfa and generate the count of messages as per >> > values received from Kafka message. >> > >> > In case there is failure and my spark streaming application is restarted >> > I see duplicate messages processed ( which is close to 2 batches) >> > >> > The problem that I have is per sec I get around 300k messages and In >> > case application is restarted I see around 3-5 million duplicate counts. >> > >> > How to avoid such duplicates? >> > >> > what is best to way to recover from such failures ? >> > >> > Thanks >> > Sandesh >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org >> For additional commands, e-mail: user-h...@spark.apache.org >> > > > > -- > Yours faithfully, Denys Cherepanin --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org