Re: Kafka Direct does not recover automatically when the Kafka Stream gets messed up?

Adrian Tanase Tue, 10 Nov 2015 01:55:08 -0800

I’ve seen this before during an extreme outage on the cluster, where the kafka 
offsets checkpointed by the directstreamRdd were bigger than what kafka 
reported. The checkpoint was therefore corrupted.
I don’t know the root cause but since I was stressing the cluster during a 
reliability test I can only assume that one of the Kafka partitions was 
restored from an out-of-sync replica and did not contain all the data. Seems 
extreme but I don’t have another idea.

@Cody – do you know of a way to recover from a situation like this? Can someone 
manually delete folders from the checkpoint folder to help the job recover? 
E.g. Go 2 steps back, hoping that kafka has those offsets.

-adrian

From: swetha kasireddy
Date: Monday, November 9, 2015 at 10:40 PM
To: Cody Koeninger
Cc: "user@spark.apache.org<mailto:user@spark.apache.org>"
Subject: Re: Kafka Direct does not recover automatically when the Kafka Stream 
gets messed up?

OK. But, one thing that I observed is that when there is a problem with Kafka 
Stream, unless I delete the checkpoint directory the Streaming job does not 
restart. I guess it tries to retry the failed tasks and if it's not able to 
recover, it fails again. Sometimes, it fails with StackOverFlow Error.

Why does the Streaming job not restart from checkpoint directory when the job 
failed earlier with Kafka Brokers getting messed up? We have the checkpoint 
directory in our hdfs.

On Mon, Nov 9, 2015 at 12:34 PM, Cody Koeninger 
<c...@koeninger.org<mailto:c...@koeninger.org>> wrote:
I don't think deleting the checkpoint directory is a good way to restart the 
streaming job, you should stop the spark context or at the very least kill the 
driver process, then restart.

On Mon, Nov 9, 2015 at 2:03 PM, swetha kasireddy 
<swethakasire...@gmail.com<mailto:swethakasire...@gmail.com>> wrote:
Hi Cody,

Our job is our failsafe as we don't have Control over Kafka Stream as of now. 
Can setting rebalance max retries help? We do not have any monitors setup as of 
now. We need to setup the monitors.

My idea is to to have some kind of Cron job that queries the Streaming API for 
monitoring like every 5 minutes and then send an email alert and automatically 
restart the Streaming job by deleting the Checkpoint directory. Would that help?

Thanks!

On Mon, Nov 9, 2015 at 11:09 AM, Cody Koeninger 
<c...@koeninger.org<mailto:c...@koeninger.org>> wrote:
The direct stream will fail the task if there is a problem with the kafka 
broker.  Spark will retry failed tasks automatically, which should handle 
broker rebalances that happen in a timely fashion. spark.tax.maxFailures 
controls the maximum number of retries before failing the job.  Direct stream 
isn't any different from any other spark task in that regard.

The question of what kind of monitoring you need is more a question for your 
particular infrastructure and what you're already using for monitoring.  We put 
all metrics (application level or system level) into graphite and alert from 
there.

I will say that if you've regularly got problems with kafka falling over for 
half an hour, I'd look at fixing that before worrying about spark monitoring...

On Mon, Nov 9, 2015 at 12:26 PM, swetha 
<swethakasire...@gmail.com<mailto:swethakasire...@gmail.com>> wrote:
Hi,

How to recover Kafka Direct automatically when the there is a problem with
Kafka brokers? Sometimes our Kafka Brokers gets messed up and the entire
Streaming job blows up unlike some other consumers which do recover
automatically. How can I make sure that Kafka Direct recovers automatically
when the broker fails for sometime say 30 minutes? What kind of monitors
should be in place to recover the job?

Thanks,
Swetha

--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Kafka-Direct-does-not-recover-automatically-when-the-Kafka-Stream-gets-messed-up-tp25331.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: 
user-unsubscr...@spark.apache.org<mailto:user-unsubscr...@spark.apache.org>
For additional commands, e-mail: 
user-h...@spark.apache.org<mailto:user-h...@spark.apache.org>

Re: Kafka Direct does not recover automatically when the Kafka Stream gets messed up?

Reply via email to