That gets me close, but when a process dies monit goes into something of a tailspin. Here is what I see in the logs when I break apache:
Dec 7 09:04:16 tecate monit[27339]: 'apache' process is not running Dec 7 09:04:16 tecate monit[27339]: 'apache' exec: /usr/bin/monit Dec 7 09:04:16 tecate monit[27339]: 'ospfd' stop on user request Dec 7 09:04:16 tecate monit[27339]: monit daemon at 27339 awakened Dec 7 09:04:16 tecate monit[27339]: Awakened by User defined signal 1 Dec 7 09:04:16 tecate monit[27339]: 'ospfd' stop: /etc/init.d/ospfd Dec 7 09:04:16 tecate monit[27339]: 'ospfd' stop action done Dec 7 09:04:16 tecate monit[27339]: 'apache' process is not running Dec 7 09:04:16 tecate monit[27339]: 'apache' exec: /usr/bin/monit Dec 7 09:04:16 tecate monit[27339]: 'ospfd' stop on user request Dec 7 09:04:16 tecate monit[27339]: monit daemon at 27339 awakened Dec 7 09:04:16 tecate monit[27339]: Awakened by User defined signal 1 Dec 7 09:04:16 tecate monit[27339]: 'ospfd' stop action done Dec 7 09:04:16 tecate monit[612]: monit: action failed -- Other action already in progress -- please try again later Dec 7 09:04:16 tecate monit[27339]: 'apache' process is not running Dec 7 09:04:16 tecate monit[27339]: 'apache' exec: /usr/bin/monit Dec 7 09:04:16 tecate monit[27339]: 'ospfd' stop on user request Dec 7 09:04:16 tecate monit[27339]: monit daemon at 27339 awakened Dec 7 09:04:16 tecate monit[27339]: Awakened by User defined signal 1 Dec 7 09:04:16 tecate monit[27339]: Awakened by User defined signal 1 Dec 7 09:04:16 tecate monit[27339]: 'ospfd' stop action done Dec 7 09:04:16 tecate monit[27339]: 'apache' process is not running Dec 7 09:04:16 tecate monit[27339]: 'apache' exec: /usr/bin/monit Dec 7 09:04:16 tecate monit[27339]: 'ospfd' stop on user request Dec 7 09:04:16 tecate monit[27339]: monit daemon at 27339 awakened ... And those messages are repeating as fast as I can tail the log. Any ideas how to make it behave a bit better? Here is what I have in my apache and ospf configs now: check process apache with pidfile /var/run/httpd.pid start program = "/etc/init.d/httpd start" stop program = "/etc/init.d/httpd stop" if does not exist then exec "/usr/bin/monit stop ospfd && /usr/bin/monit restart apache" else if recovered then exec "/usr/bin/monit monitor ospfd" if failed host localhost port 80 protocol http and request "/" then restart if children > 50 then restart if 2 restarts within 2 cycles then timeout group server depends on tomcat check process ospfd with pidfile /var/run/quagga/ospfd.pid start program = "/etc/init.d/ospfd start" stop program = "/etc/init.d/ospfd stop" depends on apache depends on fcserver depends on mysql depends on tomcat group network On 07.12.2011 01:15, Martin Pala wrote: > The dependency in monit is currently "soft" … it defines the start/stop order and action cascading, but it doesn't wait for the parent service to recover before starting the dependent service. We'll address this in the future and will support "hard" dependencies. > As a workaround you can do something like this: > --8 > > check process apache with pidfile /var/run/httpd.pid > start program = "/etc/init.d/httpd start" > stop program = "/etc/init.d/httpd stop" > if does not exist then exec "/bin/bash -c '/usr/bin/monit stop ospfd && /usr/bin/monit restart apache'" else if recovered "/usr/bin/monit start ospfd" > check process ospfd with pidfile /var/run/quagga/ospfd.pid > start program = "/etc/init.d/ospfd start" > stop program = "/etc/init.d/ospfd stop" > --8 > => if the apache crashes, the monit ospfd service is stopped and will be started only if it will recover. > Regards, > Martin > > On Dec 6, 2011, at 11:39 PM, drich wrote: > >> Changing the cycle time changes the frequency but not what happens. I still see essentially the following: >> >> * apache stops and monit detects it >> * monit attempts a restart >> * monit stops ospfd >> * mont starts apache >> * monit unmonitors ospfd >> * ... 30 seconds later ... >> * apache fails to start >> * monit starts ospfd >> * ... repeat the above cycle ... >> * after 2 cycles, it triggers my "2 restarts" rule and stops ospfd >> >> I don't think it should be starting ospfd at all since the dependent service is failing to restart. >> >> On 06.12.2011 10:34, Rory Toma wrote: >> >>> What is your cycle time? Is it 30 sec? If it is, try increasing it to 1 minute. >>> >>> On 12/6/11 9:12 AM, drich wrote: >>> >>>> As I mentioned in an earlier e-mail, I'm trying to get monit to watch a group of processes so it can start/stop ospfd for an anycast high availability application. However, in doing this I'm seeing some odd behaviour that doesn't match what I expect -- is this a bug? >>>> >>>> In the scenario below, why is it ever trying to start ospfd? If apache is down, shouldn't ospfd stay down until apache comes back up or is monitored again after being unmonitored? It does end up in the correct state at the end, but not without restarting and stopping ospfd twice in the meantime. >>>> >>>> As an example, I have the following configured: >>>> >>>> check process apache with pidfile /var/run/httpd.pid >>>> start program = "/etc/init.d/httpd start" >>>> stop program = "/etc/init.d/httpd stop" >>>> if failed host localhost port 80 protocol http >>>> and request "/" then restart >>>> if 2 restarts within 2 cycles then stop >>>> >>>> check process ospfd with pidfile /var/run/quagga/ospfd.pid >>>> start program = "/etc/init.d/ospfd start" >>>> stop program = "/etc/init.d/ospfd stop" >>>> depends on apache >>>> >>>> If I make it so that apache cannot run (by removing execute permissions on /usr/sbin/httpd) and then kill it, I see the following in the monit logs: >>>> >>>> Dec 6 08:47:39 tecate monit[9988]: 'apache' process is not running >>>> Dec 6 08:47:39 tecate monit[9988]: 'apache' trying to restart >>>> Dec 6 08:47:39 tecate monit[9988]: 'ospfd' stop: /etc/init.d/ospfd >>>> Dec 6 08:47:39 tecate monit[9988]: 'apache' start: /etc/init.d/httpd >>>> Dec 6 08:47:40 tecate monit[9988]: 'ospfd' unmonitor on user request >>>> Dec 6 08:47:40 tecate monit[9988]: monit daemon at 9988 awakened >>>> Dec 6 08:48:09 tecate monit[9988]: 'apache' failed to start >>>> Dec 6 08:48:09 tecate monit[9988]: 'ospfd' start: /etc/init.d/ospfd >>>> Dec 6 08:48:09 tecate monit[9988]: 'ospfd' unmonitor action done >>>> Dec 6 08:48:09 tecate monit[9988]: Awakened by User defined signal 1 >>>> Dec 6 08:48:09 tecate monit[9988]: 'apache' process is not running >>>> Dec 6 08:48:09 tecate monit[9988]: 'apache' trying to restart >>>> Dec 6 08:48:09 tecate monit[9988]: 'ospfd' stop: /etc/init.d/ospfd >>>> Dec 6 08:48:09 tecate monit[9988]: 'apache' start: /etc/init.d/httpd >>>> Dec 6 08:48:09 tecate monit[9988]: 'ospfd' unmonitor on user request >>>> Dec 6 08:48:09 tecate monit[9988]: monit daemon at 9988 awakened >>>> Dec 6 08:48:39 tecate monit[9988]: 'apache' failed to start >>>> Dec 6 08:48:39 tecate monit[9988]: 'ospfd' start: /etc/init.d/ospfd >>>> Dec 6 08:48:39 tecate monit[9988]: 'ospfd' unmonitor action done >>>> Dec 6 08:48:39 tecate monit[9988]: Awakened by User defined signal 1 >>>> Dec 6 08:48:39 tecate monit[9988]: 'apache' service restarted 2 times within 2 cycles(s) - stop >>>> Dec 6 08:48:39 tecate monit[9988]: 'ospfd' stop: /etc/init.d/ospfd >>>> Dec 6 08:48:39 tecate monit[9988]: 'ospfd' unmonitor on user request >>>> Dec 6 08:48:39 tecate monit[9988]: monit daemon at 9988 awakened >>>> Dec 6 08:48:39 tecate monit[9988]: Awakened by User defined signal 1 >>>> Dec 6 08:48:39 tecate monit[9988]: 'ospfd' unmonitor action done >>>> >>>> -- >>>> >>>> Dan Rich >>>> http://www.employees.org/~drich/ [1] >>>> "Step up to red alert!" "Are you sure, sir? >>>> It means changing the bulb in the sign..." >>>> - Red Dwarf (BBC) >>>> >>>> -- >>>> To unsubscribe: >>>> https://lists.nongnu.org/mailman/listinfo/monit-general [2] >> >> -- >> >> Dan Rich >> http://www.employees.org/~drich/ [4] >> "Step up to red alert!" "Are you sure, sir? >> It means changing the bulb in the sign..." >> - Red Dwarf (BBC) -- >> To unsubscribe: >> https://lists.nongnu.org/mailman/listinfo/monit-general [5] -- Dan Rich http://www.employees.org/~drich/ [6] "Step up to red alert!" "Are you sure, sir? It means changing the bulb in the sign..." - Red Dwarf (BBC) Links: ------ [1] http://www.employees.org/%7Edrich/ [2] https://lists.nongnu.org/mailman/listinfo/monit-general [3] mailto:[email protected] [4] http://www.employees.org/%7Edrich/ [5] https://lists.nongnu.org/mailman/listinfo/monit-general [6] http://www.employees.org/~drich/
-- To unsubscribe: https://lists.nongnu.org/mailman/listinfo/monit-general
