On 7 Oct 2015, at 06:28, Krzysztof Zarzycki <k.zarzy...@gmail.com<mailto:k.zarzy...@gmail.com>> wrote:
Hi Vikram, So you give up using yarn-cluster mode of launching Spark jobs, is that right? AFAIK when using yarn-cluster mode, the launch process (spark-submit) monitors job running on YARN, but if it is killed/dies, it just stops printing the state (RUNNING usually), without influencing the monitored job. So you cannot use monit features on the launch process (like restart on fail,etc.) One more thing: Monit depends on pidfiles and spark-submit (in yarn-client mode) does not create them. Do you create them on your own? Thanks! Krzysiek you know, there's nothing to stop anyone adding a little monitoring tool -just poll the YARN RM for application reports and then fail if the application -> FAILED/KILLED states. If you do this, do test what happens during AM Restart -you probably want to send a notification, but it is not as serious as a full application failure 2015-10-07 6:37 GMT+02:00 Vikram Kone <vikramk...@gmail.com<mailto:vikramk...@gmail.com>>: We are using Monit to kick off spark streaming jobs n seems to work fine. On Monday, September 28, 2015, Chen Song <chen.song...@gmail.com<mailto:chen.song...@gmail.com>> wrote: I am also interested specifically in monitoring and alerting on Spark streaming jobs. It will be helpful to get some general guidelines or advice on this, from people who implemented anything on this. On Fri, Sep 18, 2015 at 2:35 AM, Krzysztof Zarzycki <k.zarzy...@gmail.com> wrote: Hi there Spark Community, I would like to ask you for an advice: I'm running Spark Streaming jobs in production. Sometimes these jobs fail and I would like to get email notification about it. Do you know how I can set up Spark to notify me by email if my job fails? Or do I have to use external monitoring tool? I'm thinking of the following options: 1. As I'm running those jobs on YARN, monitor somehow YARN jobs. Looked for it as well but couldn't find any YARN feature to do it. 2. Run Spark Streaming job in some scheduler, like Oozie, Azkaban, Luigi. Those are created rather for batch jobs, not streaming, but could work. Has anyone tried that? 3. Run job driver under "monit" tool and catch the failure and send an email about it. Currently I'm deploying with yarn-cluster mode and I would need to resign from it to run under monit.... 4. Implement monitoring tool (like Graphite, Ganglia, Prometheus) and use Spark metrics. And then implement alerting in those. Can I get information of failed jobs in Spark metrics? 5. As 4. but implement my own custom job metrics and monitor them. What's your opinion about my options? How do you people solve this problem? Anything Spark specific? I'll be grateful for any advice in this subject. Thanks! Krzysiek -- Chen Song