Re: Kafka Direct does not recover automatically when the Kafka Stream gets messed up?

swetha kasireddy Mon, 09 Nov 2015 13:23:41 -0800

I store some metrics and the RDD which is the output of updateStateByKey in
my checkpoint directory. Will retest and check for the error that I get.
But,  it's mostly the StackOverFlowError that I get. So, increasing the
Stack size might help?


On Mon, Nov 9, 2015 at 12:45 PM, Cody Koeninger <c...@koeninger.org> wrote:

> Without knowing more about what's being stored in your checkpoint
> directory / what the log output is, it's hard to say.  But either way, just
> deleting the checkpoint directory probably isn't sufficient to restart the
> job...
>
> On Mon, Nov 9, 2015 at 2:40 PM, swetha kasireddy <
> swethakasire...@gmail.com> wrote:
>
>> OK. But, one thing that I observed is that when there is a problem with
>> Kafka Stream, unless I delete the checkpoint directory the Streaming job
>> does not restart. I guess it tries to retry the failed tasks and if it's
>> not able to recover, it fails again. Sometimes, it fails with StackOverFlow
>> Error.
>>
>> Why does the Streaming job not restart from checkpoint directory when the
>> job failed earlier with Kafka Brokers getting messed up? We have the
>> checkpoint directory in our hdfs.
>>
>> On Mon, Nov 9, 2015 at 12:34 PM, Cody Koeninger <c...@koeninger.org>
>> wrote:
>>
>>> I don't think deleting the checkpoint directory is a good way to restart
>>> the streaming job, you should stop the spark context or at the very least
>>> kill the driver process, then restart.
>>>
>>> On Mon, Nov 9, 2015 at 2:03 PM, swetha kasireddy <
>>> swethakasire...@gmail.com> wrote:
>>>
>>>> Hi Cody,
>>>>
>>>> Our job is our failsafe as we don't have Control over Kafka Stream as
>>>> of now. Can setting rebalance max retries help? We do not have any monitors
>>>> setup as of now. We need to setup the monitors.
>>>>
>>>> My idea is to to have some kind of Cron job that queries the Streaming
>>>> API for monitoring like every 5 minutes and then send an email alert and
>>>> automatically restart the Streaming job by deleting the Checkpoint
>>>> directory. Would that help?
>>>>
>>>>
>>>>
>>>> Thanks!
>>>>
>>>> On Mon, Nov 9, 2015 at 11:09 AM, Cody Koeninger <c...@koeninger.org>
>>>> wrote:
>>>>
>>>>> The direct stream will fail the task if there is a problem with the
>>>>> kafka broker.  Spark will retry failed tasks automatically, which should
>>>>> handle broker rebalances that happen in a timely fashion.
>>>>> spark.tax.maxFailures controls the maximum number of retries before 
>>>>> failing
>>>>> the job.  Direct stream isn't any different from any other spark task in
>>>>> that regard.
>>>>>
>>>>> The question of what kind of monitoring you need is more a question
>>>>> for your particular infrastructure and what you're already using for
>>>>> monitoring.  We put all metrics (application level or system level) into
>>>>> graphite and alert from there.
>>>>>
>>>>> I will say that if you've regularly got problems with kafka falling
>>>>> over for half an hour, I'd look at fixing that before worrying about spark
>>>>> monitoring...
>>>>>
>>>>>
>>>>> On Mon, Nov 9, 2015 at 12:26 PM, swetha <swethakasire...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> How to recover Kafka Direct automatically when the there is a problem
>>>>>> with
>>>>>> Kafka brokers? Sometimes our Kafka Brokers gets messed up and the
>>>>>> entire
>>>>>> Streaming job blows up unlike some other consumers which do recover
>>>>>> automatically. How can I make sure that Kafka Direct recovers
>>>>>> automatically
>>>>>> when the broker fails for sometime say 30 minutes? What kind of
>>>>>> monitors
>>>>>> should be in place to recover the job?
>>>>>>
>>>>>> Thanks,
>>>>>> Swetha
>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> View this message in context:
>>>>>> http://apache-spark-user-list.1001560.n3.nabble.com/Kafka-Direct-does-not-recover-automatically-when-the-Kafka-Stream-gets-messed-up-tp25331.html
>>>>>> Sent from the Apache Spark User List mailing list archive at
>>>>>> Nabble.com.
>>>>>>
>>>>>> ---------------------------------------------------------------------
>>>>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>>>>>> For additional commands, e-mail: user-h...@spark.apache.org
>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Kafka Direct does not recover automatically when the Kafka Stream gets messed up?

Reply via email to