Re: how to avoid duplicate messages with spark streaming using checkpoint after restart in case of failure

Cody Koeninger Wed, 22 Jun 2016 07:11:00 -0700

The direct stream doesn't automagically give you exactly-once
semantics.  Indeed, you should be pretty suspicious of anything that
claims to give you end-to-end exactly-once semantics without any
additional work on your part.


To the original poster, have you read / watched the materials linked
from the page below?  That should clarify what your options are.

https://github.com/koeninger/kafka-exactly-once

On Wed, Jun 22, 2016 at 5:55 AM, Denys Cherepanin <denusk...@gmail.com> wrote:
> Hi Sandesh,
>
> As I understand you are using "receiver based" approach to integrate kafka
> with spark streaming.
>
> Did you tried "direct" approach ? In this case offsets will be tracked by
> streaming app via check-pointing and you should achieve exactly-once
> semantics
>
> On Wed, Jun 22, 2016 at 5:58 AM, Jörn Franke <jornfra...@gmail.com> wrote:
>>
>>
>> Spark Streamig does not guarantee exactly once for output action. It means
>> that one item is only processed in an RDD.
>> You can achieve at most once or at least once.
>> You could however do at least once (via checkpoing) and record which
>> messages have been proceed (some identifier available?) and do not re
>> process them .... You could also store (safely) what range has been already
>> processed etc
>>
>> Think about the business case if exactly once is needed or if it can be
>> replaced by one of the others.
>> Exactly once, it needed requires in any system including spark more effort
>> and usually the throughput is lower. A risk evaluation from a business point
>> of view has to be done anyway...
>>
>> > On 22 Jun 2016, at 09:09, sandesh deshmane <sandesh.v...@gmail.com>
>> > wrote:
>> >
>> > Hi,
>> >
>> > I am writing spark streaming application which reads messages from
>> > Kafka.
>> >
>> > I am using checkpointing and write ahead logs ( WAL) to achieve fault
>> > tolerance .
>> >
>> > I have created batch size of 10 sec for reading messages from kafka.
>> >
>> > I read messages for kakfa and generate the count of messages as per
>> > values received from Kafka message.
>> >
>> > In case there is failure and my spark streaming application is restarted
>> > I see duplicate messages processed ( which is close to 2 batches)
>> >
>> > The problem that I have is per sec I get around 300k messages and In
>> > case application is restarted I see around 3-5 million duplicate counts.
>> >
>> > How to avoid such duplicates?
>> >
>> > what is best to way to recover from such failures ?
>> >
>> > Thanks
>> > Sandesh
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>
>
>
>
> --
> Yours faithfully, Denys Cherepanin

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: how to avoid duplicate messages with spark streaming using checkpoint after restart in case of failure

Reply via email to