Re: how to avoid duplicate messages with spark streaming using checkpoint after restart in case of failure

Mich Talebzadeh Wed, 22 Jun 2016 02:42:08 -0700

Hi Sandesh,

Where these messages end up? Are they written to a sink (file, database etc)


What is the reason your app fails. Can that be remedied to reduce the
impact.

How do you identify that duplicates are sent and processed?

Cheers,

Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com



On 22 June 2016 at 10:38, sandesh deshmane <sandesh.v...@gmail.com> wrote:

> Mich Talebzadeh thanks for reply.
>
> we have retention policy of 4 hours for kafka messages and we have
> multiple other consumers which reads from kafka cluster. ( spark is one of
> them)
>
> we have timestamp in message, but we actually have multiple message with
> same time stamp. its very hard to differentiate.
>
> But we have some offset with kafka and consumer keeps tracks for offset
> and Consumer should read from offset.
>
> so its problem with kafka , not with Spark?
>
> Any way to rectify this?
> we don't have id in messages. if we keep id also , then we keep map where
> we will put the ids and mark them processed, but at run time i need to do
> that lookup and for us , the number of messages is very high, so look up
> will ad up in processing time ?
>
> Thanks
> Sandesh Deshmane
>
> On Wed, Jun 22, 2016 at 2:36 PM, Mich Talebzadeh <
> mich.talebza...@gmail.com> wrote:
>
>> Yes this is more of Kafka issue as Kafka send the messages again.
>>
>> In your topic do messages come with an ID or timestamp where you can
>> reject them if they have already been processed.
>>
>> In other words do you have a way what message was last processed via
>> Spark before failing.
>>
>> You can of course  reset Kafka retention time and purge it
>>
>>
>> ${KAFKA_HOME}/bin/kafka-topics.sh --zookeeper rhes564:2181 --alter
>> --topic newtopic --config retention.ms=1000
>>
>>
>>
>> Wait for a minute and then reset back
>>
>>
>>
>> ${KAFKA_HOME}/bin/kafka-topics.sh --zookeeper rhes564:2181 --alter
>> --topic newtopic --config retention.ms=600000
>>
>>
>>
>> ${KAFKA_HOME}/bin/kafka-console-consumer.sh --zookeeper rhes564:2181
>> --from-beginning --topic newtopic
>>
>>
>>
>> HTH
>>
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn * 
>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>>
>> On 22 June 2016 at 09:57, sandesh deshmane <sandesh.v...@gmail.com>
>> wrote:
>>
>>> Here I refer to failure in spark app.
>>>
>>> So When I restart , i see duplicate messages.
>>>
>>> To replicate the scenario , i just do kill mysparkapp and then restart .
>>>
>>> On Wed, Jun 22, 2016 at 1:10 PM, Mich Talebzadeh <
>>> mich.talebza...@gmail.com> wrote:
>>>
>>>> As I see it you are using Spark streaming to read data from source
>>>> through Kafka. Your batch interval is 10 sec, so in that interval you have
>>>> 10*300K = 3Milion messages
>>>>
>>>> When you say there is failure are you referring to the failure in the
>>>> source or in Spark streaming app?
>>>>
>>>> HTH
>>>>
>>>> Dr Mich Talebzadeh
>>>>
>>>>
>>>>
>>>> LinkedIn * 
>>>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>>
>>>>
>>>>
>>>> http://talebzadehmich.wordpress.com
>>>>
>>>>
>>>>
>>>> On 22 June 2016 at 08:09, sandesh deshmane <sandesh.v...@gmail.com>
>>>> wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>> I am writing spark streaming application which reads messages from
>>>>> Kafka.
>>>>>
>>>>> I am using checkpointing and write ahead logs ( WAL) to achieve fault
>>>>> tolerance .
>>>>>
>>>>> I have created batch size of 10 sec for reading messages from kafka.
>>>>>
>>>>> I read messages for kakfa and generate the count of messages as per
>>>>> values received from Kafka message.
>>>>>
>>>>> In case there is failure and my spark streaming application is
>>>>> restarted I see duplicate messages processed ( which is close to 2 
>>>>> batches)
>>>>>
>>>>> The problem that I have is per sec I get around 300k messages and In
>>>>> case application is restarted I see around 3-5 million duplicate counts.
>>>>>
>>>>> How to avoid such duplicates?
>>>>>
>>>>> what is best to way to recover from such failures ?
>>>>>
>>>>> Thanks
>>>>> Sandesh
>>>>>
>>>>
>>>>
>>>
>>
>

Re: how to avoid duplicate messages with spark streaming using checkpoint after restart in case of failure

Reply via email to