[Apache Spark][Streaming Job][Checkpoint]Spark job failed on Checkpoint recovery with Batch not found error

2020-05-28 Thread taylorwu
Hi, We have a Spark 2.4 job failed on Checkpoint recovery every few hours with the following errors (from the Driver Log): driver spark-kubernetes-driver ERROR 20:38:51 ERROR MicroBatchExecution: Query impressionUpdate [id = 54614900-4145-4d60-8156-9746ffc13d1f, runId =

Re: how to avoid duplicate messages with spark streaming using checkpoint after restart in case of failure

2016-06-22 Thread Jörn Franke
That is the cost of exactly once :) > On 22 Jun 2016, at 12:54, sandesh deshmane wrote: > > We are going with checkpointing . we don't have identifier available to > identify if the message is already processed or not . > Even if we had it, then it will slow down the

Re: how to avoid duplicate messages with spark streaming using checkpoint after restart in case of failure

2016-06-22 Thread Cody Koeninger
The direct stream doesn't automagically give you exactly-once semantics. Indeed, you should be pretty suspicious of anything that claims to give you end-to-end exactly-once semantics without any additional work on your part. To the original poster, have you read / watched the materials linked

Re: how to avoid duplicate messages with spark streaming using checkpoint after restart in case of failure

2016-06-22 Thread sandesh deshmane
We have not tried direct approach . We are using receiver based approach ( we use zookeepers to connect from spark) We have around 20+ Kafka and some times we replace the kafka brokers ( they go down ). So each time I need to change list at spark application and I need to restart the streaming

Re: how to avoid duplicate messages with spark streaming using checkpoint after restart in case of failure

2016-06-22 Thread Denys Cherepanin
Hi Sandesh, As I understand you are using "receiver based" approach to integrate kafka with spark streaming. Did you tried "direct" approach ? In this case offsets will be tracked by

Re: how to avoid duplicate messages with spark streaming using checkpoint after restart in case of failure

2016-06-22 Thread sandesh deshmane
We are going with checkpointing . we don't have identifier available to identify if the message is already processed or not . Even if we had it, then it will slow down the processing as we do get 300k messages per sec , so lookup will slow down. Thanks Sandesh On Wed, Jun 22, 2016 at 3:28 PM,

Re: how to avoid duplicate messages with spark streaming using checkpoint after restart in case of failure

2016-06-22 Thread Jörn Franke
Spark Streamig does not guarantee exactly once for output action. It means that one item is only processed in an RDD. You can achieve at most once or at least once. You could however do at least once (via checkpoing) and record which messages have been proceed (some identifier available?) and

Re: how to avoid duplicate messages with spark streaming using checkpoint after restart in case of failure

2016-06-22 Thread Mich Talebzadeh
Hi Sandesh, Where these messages end up? Are they written to a sink (file, database etc) What is the reason your app fails. Can that be remedied to reduce the impact. How do you identify that duplicates are sent and processed? Cheers, Dr Mich Talebzadeh LinkedIn *

Re: how to avoid duplicate messages with spark streaming using checkpoint after restart in case of failure

2016-06-22 Thread sandesh deshmane
Mich Talebzadeh thanks for reply. we have retention policy of 4 hours for kafka messages and we have multiple other consumers which reads from kafka cluster. ( spark is one of them) we have timestamp in message, but we actually have multiple message with same time stamp. its very hard to

Re: how to avoid duplicate messages with spark streaming using checkpoint after restart in case of failure

2016-06-22 Thread Mich Talebzadeh
Yes this is more of Kafka issue as Kafka send the messages again. In your topic do messages come with an ID or timestamp where you can reject them if they have already been processed. In other words do you have a way what message was last processed via Spark before failing. You can of course

Re: how to avoid duplicate messages with spark streaming using checkpoint after restart in case of failure

2016-06-22 Thread sandesh deshmane
Here I refer to failure in spark app. So When I restart , i see duplicate messages. To replicate the scenario , i just do kill mysparkapp and then restart . On Wed, Jun 22, 2016 at 1:10 PM, Mich Talebzadeh wrote: > As I see it you are using Spark streaming to read

Re: how to avoid duplicate messages with spark streaming using checkpoint after restart in case of failure

2016-06-22 Thread Mich Talebzadeh
As I see it you are using Spark streaming to read data from source through Kafka. Your batch interval is 10 sec, so in that interval you have 10*300K = 3Milion messages When you say there is failure are you referring to the failure in the source or in Spark streaming app? HTH Dr Mich Talebzadeh

how to avoid duplicate messages with spark streaming using checkpoint after restart in case of failure

2016-06-22 Thread sandesh deshmane
Hi, I am writing spark streaming application which reads messages from Kafka. I am using checkpointing and write ahead logs ( WAL) to achieve fault tolerance . I have created batch size of 10 sec for reading messages from kafka. I read messages for kakfa and generate the count of messages as

Re: Spark Streaming data checkpoint performance

2015-11-07 Thread trung kien
ng >>>>> It took me 5 seconds to finish the same size micro-batch, why it's >>>>> too high? what's kind of job in checkpoint? >>>>> why it's keep increasing? >>>>> >>>>> 2/ When I changes the data checkpoint interval

Re: Spark Streaming data checkpoint performance

2015-11-06 Thread Thúy Hằng Lê
val works more stable. > > On Nov 4, 2015 9:08 PM, "Adrian Tanase" <atan...@adobe.com> wrote: > >> Nice! Thanks for sharing, I wasn’t aware of the new API. >> >> Left some comments on the JIRA and design doc. >> >> -adrian >> >> From: S

Re: Spark Streaming data checkpoint performance

2015-11-06 Thread Aniket Bhatnagar
kpoint interval? >> >> It seems that default interval works more stable. >> >> On Nov 4, 2015 9:08 PM, "Adrian Tanase" <atan...@adobe.com> wrote: >> >>> Nice! Thanks for sharing, I wasn’t aware of the new API. >>> >>> Lef

Re: Spark Streaming data checkpoint performance

2015-11-06 Thread Aniket Bhatnagar
stats.checkpoint(Durations.seconds(100)); //change to 100, >>>> defaults is 10 >>>> >>>> The checkpoint is keep increasing significantly first checkpoint is >>>> 10s, second is 30s, third is 70s ... and keep increasing :) >>>> Why

Re: Spark Streaming data checkpoint performance

2015-11-06 Thread Thúy Hằng Lê
s 30s, third is 70s ... and keep increasing :) >>> Why it's too high when increasing checkpoint interval? >>> >>> It seems that default interval works more stable. >>> >>> On Nov 4, 2015 9:08 PM, "Adrian Tanase" <atan...@adobe.com> wrote: &

Re: Spark Streaming data checkpoint performance

2015-11-05 Thread Thúy Hằng Lê
> Cc: Adrian Tanase, "user@spark.apache.org" > Subject: Re: Spark Streaming data checkpoint performance > > "trackStateByKey" is about to be added in 1.6 to resolve the performance > issue of "updateStateByKey". You can take a look at > https://issues.apache.org/jira/browse/SPARK-2629 and > https://github.com/apache/spark/pull/9256 >

Re: Spark Streaming data checkpoint performance

2015-11-04 Thread Adrian Tanase
Nice! Thanks for sharing, I wasn’t aware of the new API. Left some comments on the JIRA and design doc. -adrian From: Shixiong Zhu Date: Tuesday, November 3, 2015 at 3:32 AM To: Thúy Hằng Lê Cc: Adrian Tanase, "user@spark.apache.org<mailto:user@spark.apache.org>" Subject: Re:

Re: Spark Streaming data checkpoint performance

2015-11-02 Thread Adrian Tanase
From: Thúy Hằng Lê Date: Monday, November 2, 2015 at 7:20 AM To: "user@spark.apache.org<mailto:user@spark.apache.org>" Subject: Spark Streaming data checkpoint performance JavaStreamingContext jssc = new JavaStreamingContext(sparkConf, Durations.seconds(2));

Re: Spark Streaming data checkpoint performance

2015-11-02 Thread Thúy Hằng Lê
our state object more complicated and try to prune out words >with very few occurrences or that haven’t been updated for a long time > - You can do this by emitting None from updateStateByKey > > Hope this helps, > -adrian > > From: Thúy Hằng Lê > Date: Monday, Novemb

Re: Spark Streaming data checkpoint performance

2015-11-02 Thread Shixiong Zhu
"trackStateByKey" is about to be added in 1.6 to resolve the performance issue of "updateStateByKey". You can take a look at https://issues.apache.org/jira/browse/SPARK-2629 and https://github.com/apache/spark/pull/9256

Spark Streaming data checkpoint performance

2015-11-01 Thread Thúy Hằng Lê
Hi Spark guru I am evaluating Spark Streaming, In my application I need to maintain cumulative statistics (e.g the total running word count), so I need to call the updateStateByKey function on very micro-batch. After setting those things, I got following behaviors: * The Processing Time

Re: spark streaming with checkpoint

2015-01-25 Thread Balakrishnan Narendran
Yeah use streaming to gather the incoming logs and write to log file then run a spark job evry 5 minutes to process the counts. Got it. Thanks a lot. On 07:07, Mon, 26 Jan 2015 Tobias Pfeiffer t...@preferred.jp wrote: Hi, On Tue, Jan 20, 2015 at 8:16 PM, balu.naren balu.na...@gmail.com

Re: spark streaming with checkpoint

2015-01-25 Thread Balakrishnan Narendran
- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org

Re: spark streaming with checkpoint

2015-01-25 Thread Tobias Pfeiffer
Hi, On Tue, Jan 20, 2015 at 8:16 PM, balu.naren balu.na...@gmail.com wrote: I am a beginner to spark streaming. So have a basic doubt regarding checkpoints. My use case is to calculate the no of unique users by day. I am using reduce by key and window for this. Where my window duration is 24

Re: spark streaming with checkpoint

2015-01-22 Thread Balakrishnan Narendran
hours. I don’t think checkpoint in Spark Streaming can alleviate such problem, because checkpoint are mainly for fault tolerance. Thanks Jerry *From:* balu.naren [mailto:balu.na...@gmail.com] *Sent:* Tuesday, January 20, 2015 7:17 PM *To:* user@spark.apache.org *Subject:* spark

RE: spark streaming with checkpoint

2015-01-22 Thread Shao, Saisai
, memory should be enough to hold the data of 24 hours. I don’t think checkpoint in Spark Streaming can alleviate such problem, because checkpoint are mainly for fault tolerance. Thanks Jerry From: balu.naren [mailto:balu.na...@gmail.commailto:balu.na...@gmail.com] Sent: Tuesday, January 20, 2015 7

Re: spark streaming with checkpoint

2015-01-22 Thread Jörn Franke
have such a large window (24 hours), so the phenomena of memory increasing is expectable, because of window operation will cache the RDD within this window in memory. So for your requirement, memory should be enough to hold the data of 24 hours. I don’t think checkpoint in Spark Streaming can

spark streaming with checkpoint

2015-01-20 Thread balu.naren
reduction has taken place. Appreciate your help. Thank you -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/spark-streaming-with-checkpoint-tp21263.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

RE: spark streaming with checkpoint

2015-01-20 Thread Shao, Saisai
Streaming can alleviate such problem, because checkpoint are mainly for fault tolerance. Thanks Jerry From: balu.naren [mailto:balu.na...@gmail.com] Sent: Tuesday, January 20, 2015 7:17 PM To: user@spark.apache.org Subject: spark streaming with checkpoint I am a beginner to spark streaming. So