I store some metrics and the RDD which is the output of updateStateByKey in my checkpoint directory. Will retest and check for the error that I get. But, it's mostly the StackOverFlowError that I get. So, increasing the Stack size might help?
On Mon, Nov 9, 2015 at 12:45 PM, Cody Koeninger <c...@koeninger.org> wrote: > Without knowing more about what's being stored in your checkpoint > directory / what the log output is, it's hard to say. But either way, just > deleting the checkpoint directory probably isn't sufficient to restart the > job... > > On Mon, Nov 9, 2015 at 2:40 PM, swetha kasireddy < > swethakasire...@gmail.com> wrote: > >> OK. But, one thing that I observed is that when there is a problem with >> Kafka Stream, unless I delete the checkpoint directory the Streaming job >> does not restart. I guess it tries to retry the failed tasks and if it's >> not able to recover, it fails again. Sometimes, it fails with StackOverFlow >> Error. >> >> Why does the Streaming job not restart from checkpoint directory when the >> job failed earlier with Kafka Brokers getting messed up? We have the >> checkpoint directory in our hdfs. >> >> On Mon, Nov 9, 2015 at 12:34 PM, Cody Koeninger <c...@koeninger.org> >> wrote: >> >>> I don't think deleting the checkpoint directory is a good way to restart >>> the streaming job, you should stop the spark context or at the very least >>> kill the driver process, then restart. >>> >>> On Mon, Nov 9, 2015 at 2:03 PM, swetha kasireddy < >>> swethakasire...@gmail.com> wrote: >>> >>>> Hi Cody, >>>> >>>> Our job is our failsafe as we don't have Control over Kafka Stream as >>>> of now. Can setting rebalance max retries help? We do not have any monitors >>>> setup as of now. We need to setup the monitors. >>>> >>>> My idea is to to have some kind of Cron job that queries the Streaming >>>> API for monitoring like every 5 minutes and then send an email alert and >>>> automatically restart the Streaming job by deleting the Checkpoint >>>> directory. Would that help? >>>> >>>> >>>> >>>> Thanks! >>>> >>>> On Mon, Nov 9, 2015 at 11:09 AM, Cody Koeninger <c...@koeninger.org> >>>> wrote: >>>> >>>>> The direct stream will fail the task if there is a problem with the >>>>> kafka broker. Spark will retry failed tasks automatically, which should >>>>> handle broker rebalances that happen in a timely fashion. >>>>> spark.tax.maxFailures controls the maximum number of retries before >>>>> failing >>>>> the job. Direct stream isn't any different from any other spark task in >>>>> that regard. >>>>> >>>>> The question of what kind of monitoring you need is more a question >>>>> for your particular infrastructure and what you're already using for >>>>> monitoring. We put all metrics (application level or system level) into >>>>> graphite and alert from there. >>>>> >>>>> I will say that if you've regularly got problems with kafka falling >>>>> over for half an hour, I'd look at fixing that before worrying about spark >>>>> monitoring... >>>>> >>>>> >>>>> On Mon, Nov 9, 2015 at 12:26 PM, swetha <swethakasire...@gmail.com> >>>>> wrote: >>>>> >>>>>> Hi, >>>>>> >>>>>> How to recover Kafka Direct automatically when the there is a problem >>>>>> with >>>>>> Kafka brokers? Sometimes our Kafka Brokers gets messed up and the >>>>>> entire >>>>>> Streaming job blows up unlike some other consumers which do recover >>>>>> automatically. How can I make sure that Kafka Direct recovers >>>>>> automatically >>>>>> when the broker fails for sometime say 30 minutes? What kind of >>>>>> monitors >>>>>> should be in place to recover the job? >>>>>> >>>>>> Thanks, >>>>>> Swetha >>>>>> >>>>>> >>>>>> >>>>>> -- >>>>>> View this message in context: >>>>>> http://apache-spark-user-list.1001560.n3.nabble.com/Kafka-Direct-does-not-recover-automatically-when-the-Kafka-Stream-gets-messed-up-tp25331.html >>>>>> Sent from the Apache Spark User List mailing list archive at >>>>>> Nabble.com. >>>>>> >>>>>> --------------------------------------------------------------------- >>>>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org >>>>>> For additional commands, e-mail: user-h...@spark.apache.org >>>>>> >>>>>> >>>>> >>>> >>> >> >