We are going with checkpointing . we don't have identifier available to
identify if the message is already processed or not .
Even if we had it, then it will slow down the processing as we do get 300k
messages per sec , so lookup will slow down.

Thanks
Sandesh

On Wed, Jun 22, 2016 at 3:28 PM, Jörn Franke <jornfra...@gmail.com> wrote:

>
> Spark Streamig does not guarantee exactly once for output action. It means
> that one item is only processed in an RDD.
> You can achieve at most once or at least once.
> You could however do at least once (via checkpoing) and record which
> messages have been proceed (some identifier available?) and do not re
> process them .... You could also store (safely) what range has been already
> processed etc
>
> Think about the business case if exactly once is needed or if it can be
> replaced by one of the others.
> Exactly once, it needed requires in any system including spark more effort
> and usually the throughput is lower. A risk evaluation from a business
> point of view has to be done anyway...
>
> > On 22 Jun 2016, at 09:09, sandesh deshmane <sandesh.v...@gmail.com>
> wrote:
> >
> > Hi,
> >
> > I am writing spark streaming application which reads messages from Kafka.
> >
> > I am using checkpointing and write ahead logs ( WAL) to achieve fault
> tolerance .
> >
> > I have created batch size of 10 sec for reading messages from kafka.
> >
> > I read messages for kakfa and generate the count of messages as per
> values received from Kafka message.
> >
> > In case there is failure and my spark streaming application is restarted
> I see duplicate messages processed ( which is close to 2 batches)
> >
> > The problem that I have is per sec I get around 300k messages and In
> case application is restarted I see around 3-5 million duplicate counts.
> >
> > How to avoid such duplicates?
> >
> > what is best to way to recover from such failures ?
> >
> > Thanks
> > Sandesh
>

Reply via email to