OK. But, one thing that I observed is that when there is a problem with
Kafka Stream, unless I delete the checkpoint directory the Streaming job
does not restart. I guess it tries to retry the failed tasks and if it's
not able to recover, it fails again. Sometimes, it fails with StackOverFlow
Error.

Why does the Streaming job not restart from checkpoint directory when the
job failed earlier with Kafka Brokers getting messed up? We have the
checkpoint directory in our hdfs.

On Mon, Nov 9, 2015 at 12:34 PM, Cody Koeninger <c...@koeninger.org> wrote:

> I don't think deleting the checkpoint directory is a good way to restart
> the streaming job, you should stop the spark context or at the very least
> kill the driver process, then restart.
>
> On Mon, Nov 9, 2015 at 2:03 PM, swetha kasireddy <
> swethakasire...@gmail.com> wrote:
>
>> Hi Cody,
>>
>> Our job is our failsafe as we don't have Control over Kafka Stream as of
>> now. Can setting rebalance max retries help? We do not have any monitors
>> setup as of now. We need to setup the monitors.
>>
>> My idea is to to have some kind of Cron job that queries the Streaming
>> API for monitoring like every 5 minutes and then send an email alert and
>> automatically restart the Streaming job by deleting the Checkpoint
>> directory. Would that help?
>>
>>
>>
>> Thanks!
>>
>> On Mon, Nov 9, 2015 at 11:09 AM, Cody Koeninger <c...@koeninger.org>
>> wrote:
>>
>>> The direct stream will fail the task if there is a problem with the
>>> kafka broker.  Spark will retry failed tasks automatically, which should
>>> handle broker rebalances that happen in a timely fashion.
>>> spark.tax.maxFailures controls the maximum number of retries before failing
>>> the job.  Direct stream isn't any different from any other spark task in
>>> that regard.
>>>
>>> The question of what kind of monitoring you need is more a question for
>>> your particular infrastructure and what you're already using for
>>> monitoring.  We put all metrics (application level or system level) into
>>> graphite and alert from there.
>>>
>>> I will say that if you've regularly got problems with kafka falling over
>>> for half an hour, I'd look at fixing that before worrying about spark
>>> monitoring...
>>>
>>>
>>> On Mon, Nov 9, 2015 at 12:26 PM, swetha <swethakasire...@gmail.com>
>>> wrote:
>>>
>>>> Hi,
>>>>
>>>> How to recover Kafka Direct automatically when the there is a problem
>>>> with
>>>> Kafka brokers? Sometimes our Kafka Brokers gets messed up and the entire
>>>> Streaming job blows up unlike some other consumers which do recover
>>>> automatically. How can I make sure that Kafka Direct recovers
>>>> automatically
>>>> when the broker fails for sometime say 30 minutes? What kind of monitors
>>>> should be in place to recover the job?
>>>>
>>>> Thanks,
>>>> Swetha
>>>>
>>>>
>>>>
>>>> --
>>>> View this message in context:
>>>> http://apache-spark-user-list.1001560.n3.nabble.com/Kafka-Direct-does-not-recover-automatically-when-the-Kafka-Stream-gets-messed-up-tp25331.html
>>>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>>>> For additional commands, e-mail: user-h...@spark.apache.org
>>>>
>>>>
>>>
>>
>

Reply via email to