Re: how to avoid duplicate messages with spark streaming using checkpoint after restart in case of failure

sandesh deshmane Wed, 22 Jun 2016 04:08:53 -0700

We have not tried direct approach . We are using receiver based approach (
we use zookeepers to connect from spark)


We have around 20+ Kafka and some times we replace the kafka brokers ( they
go down ). So each time I need to change list at spark application and I
need to restart the streaming app.

Thanks
Sandesh

On Wed, Jun 22, 2016 at 4:25 PM, Denys Cherepanin <denusk...@gmail.com>
wrote:

> Hi Sandesh,
>
> As I understand you are using "receiver based" approach to integrate kafka
> with spark streaming.
>
> Did you tried "direct" approach
> <http://spark.apache.org/docs/latest/streaming-kafka-integration.html#approach-2-direct-approach-no-receivers>
>  ?
> In this case offsets will be tracked by streaming app via check-pointing
> and you should achieve exactly-once semantics
>
> On Wed, Jun 22, 2016 at 5:58 AM, Jörn Franke <jornfra...@gmail.com> wrote:
>
>>
>> Spark Streamig does not guarantee exactly once for output action. It
>> means that one item is only processed in an RDD.
>> You can achieve at most once or at least once.
>> You could however do at least once (via checkpoing) and record which
>> messages have been proceed (some identifier available?) and do not re
>> process them .... You could also store (safely) what range has been already
>> processed etc
>>
>> Think about the business case if exactly once is needed or if it can be
>> replaced by one of the others.
>> Exactly once, it needed requires in any system including spark more
>> effort and usually the throughput is lower. A risk evaluation from a
>> business point of view has to be done anyway...
>>
>> > On 22 Jun 2016, at 09:09, sandesh deshmane <sandesh.v...@gmail.com>
>> wrote:
>> >
>> > Hi,
>> >
>> > I am writing spark streaming application which reads messages from
>> Kafka.
>> >
>> > I am using checkpointing and write ahead logs ( WAL) to achieve fault
>> tolerance .
>> >
>> > I have created batch size of 10 sec for reading messages from kafka.
>> >
>> > I read messages for kakfa and generate the count of messages as per
>> values received from Kafka message.
>> >
>> > In case there is failure and my spark streaming application is
>> restarted I see duplicate messages processed ( which is close to 2 batches)
>> >
>> > The problem that I have is per sec I get around 300k messages and In
>> case application is restarted I see around 3-5 million duplicate counts.
>> >
>> > How to avoid such duplicates?
>> >
>> > what is best to way to recover from such failures ?
>> >
>> > Thanks
>> > Sandesh
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>
>>
>
>
> --
> Yours faithfully, Denys Cherepanin
>

Re: how to avoid duplicate messages with spark streaming using checkpoint after restart in case of failure

Reply via email to