Hi Sandesh, As I understand you are using "receiver based" approach to integrate kafka with spark streaming.
Did you tried "direct" approach <http://spark.apache.org/docs/latest/streaming-kafka-integration.html#approach-2-direct-approach-no-receivers> ? In this case offsets will be tracked by streaming app via check-pointing and you should achieve exactly-once semantics On Wed, Jun 22, 2016 at 5:58 AM, Jörn Franke <jornfra...@gmail.com> wrote: > > Spark Streamig does not guarantee exactly once for output action. It means > that one item is only processed in an RDD. > You can achieve at most once or at least once. > You could however do at least once (via checkpoing) and record which > messages have been proceed (some identifier available?) and do not re > process them .... You could also store (safely) what range has been already > processed etc > > Think about the business case if exactly once is needed or if it can be > replaced by one of the others. > Exactly once, it needed requires in any system including spark more effort > and usually the throughput is lower. A risk evaluation from a business > point of view has to be done anyway... > > > On 22 Jun 2016, at 09:09, sandesh deshmane <sandesh.v...@gmail.com> > wrote: > > > > Hi, > > > > I am writing spark streaming application which reads messages from Kafka. > > > > I am using checkpointing and write ahead logs ( WAL) to achieve fault > tolerance . > > > > I have created batch size of 10 sec for reading messages from kafka. > > > > I read messages for kakfa and generate the count of messages as per > values received from Kafka message. > > > > In case there is failure and my spark streaming application is restarted > I see duplicate messages processed ( which is close to 2 batches) > > > > The problem that I have is per sec I get around 300k messages and In > case application is restarted I see around 3-5 million duplicate counts. > > > > How to avoid such duplicates? > > > > what is best to way to recover from such failures ? > > > > Thanks > > Sandesh > > --------------------------------------------------------------------- > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > > -- Yours faithfully, Denys Cherepanin