We’re deploying using YARN in cluster mode, to take advantage of automatic
restart of long running streaming app. We’ve also done a POC on top of
Mesos+Marathon, that’s always an option.
For monitoring / alerting, we’re using a combination of:
* Spark REST API queried from OpsView via nagios style checks
* Here, we have thresholds on things like number of successful jobs /
tasks, total execution time, etc
* Custom business/operational metrics logged manually from the streaming
app to OpenTSDB
* we’re using a combination of spark accumulators and custom RDDs –
after summarizing some counters we’re pushing them to OpenTSDB via the REST API
* we’re using dashboards built with Grafana that poll OpenTSDB – nicer
looking, same functionality
* We have a custom opsview check that queries OpenTSDB and looks for
some successful number of events processed by the job over a period of time
* This is coupled with a stable stream of data from a canary instance
Hope this helps – feel free to google around for all the above buzzwords :). I
can get into more details on demand.
-adrian
From: Chen Song
Date: Monday, September 28, 2015 at 5:00 PM
To: Krzysztof Zarzycki
Cc: user
Subject: Re: Notification on Spark Streaming job failure
I am also interested specifically in monitoring and alerting on Spark streaming
jobs. It will be helpful to get some general guidelines or advice on this, from
people who implemented anything on this.
On Fri, Sep 18, 2015 at 2:35 AM, Krzysztof Zarzycki
<[email protected]<mailto:[email protected]>> wrote:
Hi there Spark Community,
I would like to ask you for an advice: I'm running Spark Streaming jobs in
production. Sometimes these jobs fail and I would like to get email
notification about it. Do you know how I can set up Spark to notify me by email
if my job fails? Or do I have to use external monitoring tool?
I'm thinking of the following options:
1. As I'm running those jobs on YARN, monitor somehow YARN jobs. Looked for it
as well but couldn't find any YARN feature to do it.
2. Run Spark Streaming job in some scheduler, like Oozie, Azkaban, Luigi. Those
are created rather for batch jobs, not streaming, but could work. Has anyone
tried that?
3. Run job driver under "monit" tool and catch the failure and send an email
about it. Currently I'm deploying with yarn-cluster mode and I would need to
resign from it to run under monit....
4. Implement monitoring tool (like Graphite, Ganglia, Prometheus) and use Spark
metrics. And then implement alerting in those. Can I get information of failed
jobs in Spark metrics?
5. As 4. but implement my own custom job metrics and monitor them.
What's your opinion about my options? How do you people solve this problem?
Anything Spark specific?
I'll be grateful for any advice in this subject.
Thanks!
Krzysiek
--
Chen Song