Re: how to avoid duplicate messages with spark streaming using checkpoint after restart in case of failure

2016-06-22 Thread Jörn Franke
That is the cost of exactly once :) > On 22 Jun 2016, at 12:54, sandesh deshmane wrote: > > We are going with checkpointing . we don't have identifier available to > identify if the message is already processed or not . > Even if we had it, then it will slow down the

Re: how to avoid duplicate messages with spark streaming using checkpoint after restart in case of failure

2016-06-22 Thread Cody Koeninger
The direct stream doesn't automagically give you exactly-once semantics. Indeed, you should be pretty suspicious of anything that claims to give you end-to-end exactly-once semantics without any additional work on your part. To the original poster, have you read / watched the materials linked

Re: how to avoid duplicate messages with spark streaming using checkpoint after restart in case of failure

2016-06-22 Thread sandesh deshmane
We have not tried direct approach . We are using receiver based approach ( we use zookeepers to connect from spark) We have around 20+ Kafka and some times we replace the kafka brokers ( they go down ). So each time I need to change list at spark application and I need to restart the streaming

Re: how to avoid duplicate messages with spark streaming using checkpoint after restart in case of failure

2016-06-22 Thread Denys Cherepanin
Hi Sandesh, As I understand you are using "receiver based" approach to integrate kafka with spark streaming. Did you tried "direct" approach ? In this case offsets will be tracked by

Re: how to avoid duplicate messages with spark streaming using checkpoint after restart in case of failure

2016-06-22 Thread sandesh deshmane
We are going with checkpointing . we don't have identifier available to identify if the message is already processed or not . Even if we had it, then it will slow down the processing as we do get 300k messages per sec , so lookup will slow down. Thanks Sandesh On Wed, Jun 22, 2016 at 3:28 PM,

Re: how to avoid duplicate messages with spark streaming using checkpoint after restart in case of failure

2016-06-22 Thread Jörn Franke
Spark Streamig does not guarantee exactly once for output action. It means that one item is only processed in an RDD. You can achieve at most once or at least once. You could however do at least once (via checkpoing) and record which messages have been proceed (some identifier available?) and

Re: how to avoid duplicate messages with spark streaming using checkpoint after restart in case of failure

2016-06-22 Thread Mich Talebzadeh
Hi Sandesh, Where these messages end up? Are they written to a sink (file, database etc) What is the reason your app fails. Can that be remedied to reduce the impact. How do you identify that duplicates are sent and processed? Cheers, Dr Mich Talebzadeh LinkedIn *

Re: how to avoid duplicate messages with spark streaming using checkpoint after restart in case of failure

2016-06-22 Thread sandesh deshmane
Mich Talebzadeh thanks for reply. we have retention policy of 4 hours for kafka messages and we have multiple other consumers which reads from kafka cluster. ( spark is one of them) we have timestamp in message, but we actually have multiple message with same time stamp. its very hard to

Re: how to avoid duplicate messages with spark streaming using checkpoint after restart in case of failure

2016-06-22 Thread Mich Talebzadeh
Yes this is more of Kafka issue as Kafka send the messages again. In your topic do messages come with an ID or timestamp where you can reject them if they have already been processed. In other words do you have a way what message was last processed via Spark before failing. You can of course

Re: how to avoid duplicate messages with spark streaming using checkpoint after restart in case of failure

2016-06-22 Thread sandesh deshmane
Here I refer to failure in spark app. So When I restart , i see duplicate messages. To replicate the scenario , i just do kill mysparkapp and then restart . On Wed, Jun 22, 2016 at 1:10 PM, Mich Talebzadeh wrote: > As I see it you are using Spark streaming to read

Re: how to avoid duplicate messages with spark streaming using checkpoint after restart in case of failure

2016-06-22 Thread Mich Talebzadeh
As I see it you are using Spark streaming to read data from source through Kafka. Your batch interval is 10 sec, so in that interval you have 10*300K = 3Milion messages When you say there is failure are you referring to the failure in the source or in Spark streaming app? HTH Dr Mich Talebzadeh

how to avoid duplicate messages with spark streaming using checkpoint after restart in case of failure

2016-06-22 Thread sandesh deshmane
Hi, I am writing spark streaming application which reads messages from Kafka. I am using checkpointing and write ahead logs ( WAL) to achieve fault tolerance . I have created batch size of 10 sec for reading messages from kafka. I read messages for kakfa and generate the count of messages as