Re: Feature Request HostGroup in environment.
I think thats a great idea. Which is probably why it already exists... :) try MON_GROUP and MON_SERVICE (I see that the documentation doesn't list those for monitors, only for alerts, but they do exist and work.) -David On Wed, Jan 6, 2010 at 1:30 PM, Nathan Gibbs nat...@cmpublishers.com wrote: What. Export the HostGroup of the about to be run monitor into its environment. Possibly something like MON_HOST_GROUP Why? Summary To give a monitor a way to identify itself form another instance of itself running in a different HostGroup. Detail For years I've had a situation where a server reboot or an snmpd service restart would occasionally put the reboot.monitor into an error state for a random amount of time longer than necessary. Anywhere from 5 minutes to hours. Sometimes the problem would fix itself, other time I would have to rm the state file. What was happening was that the reboot.monitor in HG1 where the reboot happened would write the state file just after the reboot.monitor in HG2 would read it. Obviously the monitor in HG2 would write out incorrect data for the hosts in HG1. Yes, in this particular instance I could use the --statefile= option be done with it. However I'm thinking beyond this particular monitor. If this feature was added 1. any monitor that needed a unique statefile name could trivially get one. $STATEFILE = $ENV{MON_HOST_GROUP} . $ME.state; This or something like it could be added to the monitor template. 2. All statefiles would follow a convention of HostGroup.Monitor.state. 3. It would be easy to know what file was built by which monitor instance. 4. No need to implement an option to set a statefile name. 5. Simpler config as the above options are no longer needed. What do you think? :-) -- Sincerely, Nathan Gibbs Systems Administrator Christ Media http://www.cmpublishers.com ___ mon mailing list mon@linux.kernel.org http://linux.kernel.org/mailman/listinfo/mon ___ mon mailing list mon@linux.kernel.org http://linux.kernel.org/mailman/listinfo/mon
Re: syntax error using exclude_period
On Mon, Dec 7, 2009 at 4:51 PM, Alex Dean a...@crackpot.org wrote: This is using the mon package provided by Ubuntu Karmic (9.10). # dpkg --list | grep mon ... skipping a bunch of mono stuff ... ii mon 0.99.2-13ubuntu1 monitor hosts/services/whatever and alert ab I'm pretty sure thats your problem right there. I think this was a bug in that version of Mon. (And that version is 6+ years old at least) Please upgrade and try again. -David ___ mon mailing list mon@linux.kernel.org http://linux.kernel.org/mailman/listinfo/mon
Re: Polling interval not updated when montraps are received
On Fri, Nov 20, 2009 at 12:28 PM, Anders Synstad ander...@basefarm.no wrote: On the server side however, the check works as a heartbeat. Checking if the localservice is still alive. But this is only performed once every hour. My suggestion would be to use the 'redistribute' feature that was added a while back on the agent, causing it to pass every status update to the master, so you can see that the check was run recently and the result was OK. Then you can also set the traptimeout setting to ensure that you are receiving traps at regular intervals, and alert if the agent stops sending traps. I did exactly this with Mon with a master/slave Mon setup. (Its why I implemented the redistribute feature) -David ___ mon mailing list mon@linux.kernel.org http://linux.kernel.org/mailman/listinfo/mon
Re: dns resolver monitoring?
There already is support in dns.monitor for recursive server testing. (I wrote the code) try dns.monitor -caching_only -query www.yahoo.com:A -query google.com:MX servername -David On Tue, Nov 3, 2009 at 4:11 PM, Nathan Gibbs nat...@cmpublishers.com wrote: * Kastus Shchuka wrote: On Tue, Nov 03, 2009 at 12:24:33PM -0500, Nathan Gibbs wrote: Isn't a resolver part of the OS libraries that do DNS lookups, not a network service that can be checked. Mike probably used resolver meaning recursive/caching server Yeah, your right there. There is no sense in monitoring resolver libraries. My point exactly. At least, that was what I was trying to say. :-) Yo may want to look at http://cr.yp.to/djbdns/separation.html for explanation. dns.monitor -caching_only record:TXT:result should be able to do it, but doesn't appear to work like the instructions say. There are too many aspects involved in recursive name resolution and there is no easy way (or sense) to monitor all of them. Right. dns.monitor is only proving that all authoritative DNS servers serve the same zone information. They do not check if published zone is correct, though. One possible way to monitor recursive/caching server would be to resolve a name coming from a known good authritative server. It's fairly easy to script and convert into a monitor. Yeah, A few mod's to dns.monitor would make that work. I don't plan on doing it this year, maybe next. -- Sincerely, Nathan Gibbs Systems Administrator Christ Media http://www.cmpublishers.com ___ mon mailing list mon@linux.kernel.org http://linux.kernel.org/mailman/listinfo/mon ___ mon mailing list mon@linux.kernel.org http://linux.kernel.org/mailman/listinfo/mon
Re: multi depend gives error
Udo, Mon depend expressions are perl expressions. You probably want: depend webservers1:ldap gateway:ping -David On Tue, Aug 19, 2008 at 7:23 AM, Udo Rader [EMAIL PROTECTED] wrote: -BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Hi, I almost don't dare to ask ... Using the insanely old mon provided by debian (0.99.2-12) I get errors in my syslog when I use more than one dependency on a depend line, eg: - ---CUT--- watch foo service bar depend webservers1:ldap gateway:ping [...] - ---CUT--- Syslog then shows this: - ---CUT--- eval error for dependency starting at webservers1:ldap gateway:ping - ---CUT--- So if anybody has an idea how to deal with that, I would be very grateful (even if only updating solves it :-) Thanks! - -- Udo Rader http://www.bestsolution.at -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.9 (GNU/Linux) Comment: Using GnuPG with Mandriva - http://enigmail.mozdev.org iEYEARECAAYFAkiqrSkACgkQJkMMup66A9ya1ACgrATYZNG1iFJYaY6ot+AAnlpq 5bkAoN5rqnPOhCU3Fb0YBzmVaBjiEQIj =fiXg -END PGP SIGNATURE- ___ mon mailing list mon@linux.kernel.org http://linux.kernel.org/mailman/listinfo/mon ___ mon mailing list mon@linux.kernel.org http://linux.kernel.org/mailman/listinfo/mon
Re: mon.cgi very slow, communication protocol improvements?
Rune, You didn't mention some important bits of imformation, most significantly what version of Mon, Mon::Client and mon.cgi you are using. There have been significant protocol changes in various versions. Speed problems that occurred with 0.99.2 are pretty much gone with 1.2.0 for example. Also what OS are you running on? I'm using mon with well over 100 hostgroups without any performance problems, with mon.cgi rendering a full page in under a second typically. I can't see how the performance would fail to scale to 600. Off the top of my head I'm guessing that Storable would actually increase the overhead in the mon server cgi, as the data still has to be transformed into the sharable form and then re-parsed. -David On Thu, Apr 3, 2008 at 4:09 AM, Rune Kristian Viken [EMAIL PROTECTED] wrote: I'm using mon to monitor 600 hostgroups, with an average of 8 or so services each. The total number of hosts is 1000. The main problem I've come accross is that mon.cgi is slow, and after some debugging, it seems that it's the communication with the mon-server that is slow. I have to wait an average of about 12 seconds per pageview. I've tried digging around a bit, and it seems that it's two routines in query_opstatus that takes quite a long time: %op_success = mon_list_successes; %op_failure = mon_list_failures ; Looking at the communication protocol, it seems that the main drawback is that mon has to spin through a *lot* of data-structures and present them in a nice way. I was thinking that this might accomplished faster by sharing the %watch and maybe %groups data-structure from mon, with the help of http://perldoc.perl.org/Storable.html .. but even though I feel I have decent know-how of mon-internals, I don't feel they're entirely up to scratch on how to implement this. Is it a good idea? A horrible idea? Am I barking up the wrong tree, with something else being the main problem here? -- Rune Kristian Viken ___ mon mailing list mon@linux.kernel.org http://linux.kernel.org/mailman/listinfo/mon ___ mon mailing list mon@linux.kernel.org http://linux.kernel.org/mailman/listinfo/mon
Re: avoid duplicated alerts in a multi-host/mon context
On 10/17/07, Jacques Klein [EMAIL PROTECTED] wrote: Well, not really, or not enough in fact. If I understand the depend, it's a way to avoid multiple alerts by specifying dependencies between services in ONE mon. If I take this concept, then it would have to be extended to dependencies between services in a GROUP of mon(s) (one per host), interesting but seems very complicated. If you configure each of your mon servers to send traps to all of the others on status updates, then you can use dependencies on each server based on state changes from other servers. If they're all one one LAN you could probably even do that by sending the status updates as broadcast packets. I've never tried that, it might take minor coding in Mon to make it process broadcast packets. Of course even better would be multicast, but that would definitely require some code changes. The best way to cause all status updates to get propagated is by using the 'redistribute' config option. From the manual: redistribute alert [arg...] A service may have one redistribute option, which is a special form of an an alert definition. This alert will be called on every service status update, even sequential success status updates. This can be used to integrate Mon with another moni- toring system, or to link together multiple Mon servers via an alert script that generates Mon traps. See the ALERT PROGRAMS section above for a list of the parameters mon will pass auto- matically to alert programs. Combine redistribute with trap.alert, define all your watches and services on all servers, and then you can do lots of stuff with dependencies. -David ___ mon mailing list mon@linux.kernel.org http://linux.kernel.org/mailman/listinfo/mon
Re: Disable all alerting for 20 minutes
--On Thursday, December 14, 2006 00:20:40 +1030 Ben Ragg [EMAIL PROTECTED] wrote: Hi there, We often make changes to our network at 3am, and while every effort is made to disable the appropriate services, quite often something will slip through the cracks and wake someone up. Is there an option to disable all alerts from being sent for 20 minutes, and only display via the webpage (Failed, NoAlerts) There are a few options right now for this. If its a regular occurance you could configure an exclude period on the services, or configure the alert periods themselves to exclude that time frame. If its an irregular occurance you can stop the mon scheduler via the web interface (or from cron), and restart when done.(The UI will see no updates, because nothing will be tested...) Finally the most evil hack style method, which I've used on occasion, is: cd mon-alert-dir chmod -x * ... maintenance here chmod +x * You could also do something like write a script that uses Mon::Client and disables all hostgroups. (This would show the status updates in the UI without sending alerts, at least with the current (CVS, 1.2.0rc1) Mon it would, I can't remember whether 0.99.2 did that.) -David ___ mon mailing list mon@linux.kernel.org http://linux.kernel.org/mailman/listinfo/mon
Re: Handwritten sms.alert doesn't get executed
I sent this earlier, but I must have sent it from the wrong address and its sitting in a moderation queue... -David -- Forwarded message -- From: David Nolan [EMAIL PROTECTED] Date: Dec 1, 2006 8:27 AM Subject: Re: Handwritten sms.alert doesn't get executed To: mon@linux.kernel.org On 12/1/06, Steven Schubiger [EMAIL PROTECTED] wrote: Hi! I've been quite trying for a while to get a handwritten SMS executed by mon. Everything is fine if I open a terminal and run the script with same parameters as defined in mon.cf -- the SMS gets send. When run by mon, nothing happens. Looked through the mailing list archive and found some familiar threads which had some interesting remarks: I checked if * permissions are right (same as for all other alerts) * the interpreter line was valid (same as for all other alerts) * no absolute path specified (same as for most other alerts) * perl -c sms.alertemits no warnings (same as for all other alerts) Furthermore, I checked whether the script runs, but obviously it doesn't. I've examined the syslog and the output generated from mon when called with the debugging flag, but they leave me in a rather clueless state. Thanks in advance, Steven Steven, Some suggestions for you: You said no absolute path specified, did you mean no non-absolute paths specified? i.e. if your script runs a program named foobar from /usr/local/bin it should be calling it as /usr/local/bin/foobar, not assuming /usr/local/bin is in $PATH. (Alternatively you can set $PATH in your script...) Try adding some debugging code in your script. i.e. if its perl add something like: open(LOG, /tmp/alertlog); ... print LOG got to step XXX\n; ... print LOG got to step YYY\n; When testing your alert are you also passing the other options that Mon sets when calling an alert? i.e. '-g group -s service -h list of hosts here', etc... (See the Mon man page for full documentation.) What user do you run mon as? Have you tested su'ing to that user and running the script? Can you post a copy of your script for us to look at? (without any SMS numbers, of course...) -David ___ mon mailing list mon@linux.kernel.org http://linux.kernel.org/mailman/listinfo/mon
Re: Questions about snmpmonitoring and ALERT / UPALERT
--On October 13, 2006 9:41:52 AM -0400 Bill Chmura [EMAIL PROTECTED] wrote: Hello, Yesterday I installed two temperature sensors in my server room. I set them both for 10 degrees higher than the current. Well, the building people raise the temperature up at night to save on energy. I do have my own cooling system in there, but it did not compensate for the building raising and set off the alarms. My threshold was for 75 degrees and the peek it went up to was 76. Unfortunately it paged me around 75 times last night. Ah, I believe you've just learned the first lesson of monitoring... Never enable paging on a new test/service until you've run the monitoring test for a while first. So it basically went like this: ALERT (temp 75.7) UPALERT (temp 75.3) ALERT (temp 75.4) UPALERT (temp 75.6) etc, etc... All of these are above the stated MAX limit of 75. For some reason, ever other one is coming as good news - even though the temperature could have gone up. I am going to spend part of today insuring I can sleep tonight (first by raising the MAX temp) by solving this - but if anyone has any thoughts on this - i would love to hear them. I have a suspicion of whats going on here. I believe the current mon version has a feature (or bug, depending on your point of view) where the UPALERT summary detail messages are actualy from the last failure, not from the OK test. I suspect the temperature was actually crossing the threshold repeatedly, something like this: test 1: 75.7 - ALERT 75.7 test 2: 75.3 (no alert) test 3: 75 - UPALERT 75.3 etc... There has been debate in the past about whether providing the 'last failure' content is useful for indicating what failure ended, or is confusing because it looks like its saying that state is OK. I feel its confusing, and at CMU we're running with a patched mon that provides the success output during an upalert. I can't remember right now whether a decision was made about changing this behavior. If we decided to change it, the change must have gotten missed during one of the big merges between Jim's alert structure rewrites and my behavior changes. So, the messages you got were confusing, but the temperature was probably crossing your threshold repeatedly. You might want to experiment with putting a longer threshold in place before you alert, i.e. 'alertafter 3'. Or you could de-bounce the monitor test somehow. Maybe configure it with two values, a low-water mark and a high-water mark, and exit with different exit codes. e.g. use exit code 1 when temperature is 75-78, exit code 2 with temperature over 78. Then you could only send email on temperatures in the 75-78 range, and page on temps over 78. -David ___ mon mailing list mon@linux.kernel.org http://linux.kernel.org/mailman/listinfo/mon
Re: SNMP monitoring
On 9/15/06, Bill Chmura [EMAIL PROTECTED] wrote: Hey, I've been muddling my way through getting the SNMP working with Mon, and I am happy to report that I have had more trouble with finding the right MIBS than I have with getting MON to work with them. Good job! Some thoughts after going through this process. (I am running CVS from last week sometime). This is all regarding snmpvar.monitor. * The contrib directory has up to 1.4 in it, but Mon comes with 1.6. Should something be noted in the contrib area that its being maintained in the main distribution? The contrib directory on www.kernel.org is actually a bit out of date, the mon-contrib area in CVS has several newer scripts. Since snmpvar.monitor has been integrated into the primary distribution I've removed it from the CVS contrib area (just now...). * The readme for it refers to having UCD SNMP installed. I found that in late 2000 it changed its name to NET-SNMP. Still works fine, but its easier to find in package management than UCD SNMP. Should it be changed? I just commited a fix to the readme. * The readme also instructs you to copy snmpvar.def, snmpvar.cf to your mon etc directory. These are not in the main mon package. I found them in the contrib tgz for the last snmpvar and used those. They worked fine, but the directions should probably be updated or the files included. The files are include in the mon package, in the etc directory. Anyway, I would be more than happy to put it all together and send someone updates they could drop into cvs. I'd love to contribute something back to the project. Feedback like this is already a useful contribution. Not every contribution has to come in the form of code updates... :) -David ___ mon mailing list mon@linux.kernel.org http://linux.kernel.org/mailman/listinfo/mon
Re: New release?
On 9/7/06, Bill Chmura [EMAIL PROTECTED] wrote: If someone wants to update the tag on the mon-client so the new stuff that fixes mon.cgi is in, I would be more than happy to roll a few tarballs so there could be a new release. I'd actually already moved the tag, but was waiting for Jim to put a release out. However since he hasn't gotten to it and there is clearly demand for it, I'll at least publish a release candidate. I've placed mon and mon-client 1.2.0-RC1 files here for review: http://www.managedandmonitored.net/mon/ -David ___ mon mailing list mon@linux.kernel.org http://linux.kernel.org/mailman/listinfo/mon
Re: rotate Downtime log
No, Mon does not currently support this format. Using logrotate is probably an OK approach, but I suspect that you would need to restart Mon to get it to close the file and create a new one. (Haven't confirmed that, but I don't think it re-opens the file every time...) A better answer would be to add log rotation support to Mon so that at a rotation time it doesn't lose all knowledge of past failures. -David On 9/4/06, pingouin osmolateur [EMAIL PROTECTED] wrote: Hi everybody Can I use this format to rotate downtime log or something equal? logdir = /var/log/mon%YEAR-%MONTH Or is there an other solucion, i know i can use logorate. Thnaks in advance ac p4.vert.ukl.yahoo.com uncompressed/chunked Mon Sep 4 16:13:33 GMT 2006 ___ Découvrez un nouveau moyen de poser toutes vos questions quelque soit le sujet ! Yahoo! Questions/Réponses pour partager vos connaissances, vos opinions et vos expériences. http://fr.answers.yahoo.com ___ mon mailing list mon@linux.kernel.org http://linux.kernel.org/mailman/listinfo/mon ___ mon mailing list mon@linux.kernel.org http://linux.kernel.org/mailman/listinfo/mon
Re: Alert After not working for me...
On 9/8/06, Bill Chmura [EMAIL PROTECTED] wrote: I've recently spent a lot of time overhauling my mon.cf. I moved to m4 macros which I had been meaning to try (I recommend them to anyone who has not tried them for mon.cf). (Note to self: I really need to put together a public release of the system we use at CMU for maintaining our mon config file. It's a complete database driven web app for maintaining a large mon config...) Basically, I was thinking for a few services that are touchy to have the system regularly test every 30 minutes. But if it has a failure to test every minute. Then issue an alert if it fails 5 times in one minute. Is that a typo? How can it fail 5 times in one minute if you're only testing in every minute? Since you didn't include a mon.cf snippet I'll have to guess a bit here about whats going on. I suspect you're trying to describe something like: ... interval 30m failure_interfal 10s period alertafter 5 1m I think you're trying to use the two-argument form of alertafter in a way other then the intent. The two argument form is to detect intermittent failures, i.e. 'alertafter 2 6h' would alert if a service fails twice in six hours. In the case of an intermittent failure a single failure would only result in two tests at the faster test rate before returning to the regular test rate. For what you're describing I think you want either 'alertafter 5' (i.e. 5 consecutive failures) or 'alertafter 50s' (i.e. alert when a service has failed every test for 50 seconds) -David ___ mon mailing list mon@linux.kernel.org http://linux.kernel.org/mailman/listinfo/mon
Re: A question about mon.cgi and monshow
On 9/5/06, Bill Chmura [EMAIL PROTECTED] wrote: I installed recently the latest cvs (1-2-0) of both the monitor and of the client. MON.CGI --- I put mon.cgi in my web server, but when I run it - basically it spits out into the logs: Cannot locate object list_views via package Mon::Client at . mon.cgi line 2175 GENO line 1. I would swear I committed that code to CVS at one point, but its not there. I just committed changes to Mon::Client to provide the view related methods. (This is a relatively new feature in mon where filtered client views are implemented in the server, rather then having to be implemented in every client.) I also have a problem with fping and unidentified output, but we talked about this before so I am going to go stfw and archives on that one. I may have a different version of fping.monitor that has the code necessary to handle this output, can you tell me exactly what extra output from fping you're seeing? -David ___ mon mailing list mon@linux.kernel.org http://linux.kernel.org/mailman/listinfo/mon
Re: CVS Access broke?
--On Thursday, August 31, 2006 12:16:25 -0400 Jim Trocki [EMAIL PROTECTED] wrote: On Thu, 31 Aug 2006, Bill Chmura wrote: Which version is recommended at this point? this should do you well: ftp://ftp.kernel.org/pub/software/admin/mon/devel mon-1.1.0pre1.tar.gz mon-client-1.0.0pre2.tar.gz He really should be using at least mon-1-1-0pre3 there werw a couple significant bugs in pre1, and there have been a couple minor fixes since then. Jim, if I tag the current code as mon-1-1-0pre4 and mon-client-1-1-0pre3 can you put up tarballs of both of those, and maybe of mon-contrib as well? If you don't have the time I can put up images somewhere else. -David ___ mon mailing list mon@linux.kernel.org http://linux.kernel.org/mailman/listinfo/mon
Re: CVS Access broke?
On 9/1/06, Jim Trocki [EMAIL PROTECTED] wrote: ok sure, and i guess we should just fork it and call the branch 1.2, or 2.0. the head trunk we can begin calling 1.3 or 2.1, following the odd #s devel, even #s stable paradigm. i can take care of that and the updates to the web page and other related stuff sometime within the next week. OK, following that convention I've just tagged the current CVS as mon-1-2-0 and mon-client-1-2-0 respectively. I haven't created a branch from that tag yet, but we can do so if you want. (If we're just going to be doing minor bug fixes for a while there probably is no need to branch just yet.) Please create tarballs and publish to both ftp.kernel.org and the sourceforge files area. -David ___ mon mailing list mon@linux.kernel.org http://linux.kernel.org/mailman/listinfo/mon
RE: Question on Redistribute
--On Thursday, August 24, 2006 14:54:05 -0500 Tim Carr [EMAIL PROTECTED] wrote: - Would it be possible to just send one everything is ok trap for a new overall check? Maybe a new monitor script that queries itself to see if there are any existing problems and will alert based off that? - I'd also continue to send an alert per service if a new service problem is detected. - On the corporate server, I'd setup only setup one service per store entry that would have the traptimeout monitor (to watch for the network outages) but still have a service entry for each server to catch any of the specific service outage traps that would be received. One scenario I can envision that would work which may be what you're trying to describe here is: - Services at remote sites monitored at desired frequency (10s), traps sent to corporate via alert/upalert, i.e. only during failures. - On the real services do not configure an alertevery option, so traps are resent every 10 seconds, in case the UDP packet is dropped. - You probably would also want a startupalert configured here to set the initial status to OK on the corporate server. - Add one fake service that always returns an OK result, run it once per minute and redistribute the status to corporate. For this service only you would want traptimeout configured at corporate. - Possibly add monitoring of the remote sites from the corporate server, including monitoring of Mon itself. -David ___ mon mailing list mon@linux.kernel.org http://linux.kernel.org/mailman/listinfo/mon
Re: Flapping monitor
--On Thursday, August 24, 2006 13:58:49 +0200 Emilio Mira Alfaro [EMAIL PROTECTED] wrote: Hi list, I'm trying to configure mon to alert when one of our routers interfaces flaps 3 times during 30 secs. I also would like mon not to send more than 1 alert every 30 minutes I came up with this config: watch mad_log_flapping service path_a description flapping on path_a period wd {Mon-Sun} alertafter 3 30s alertevery 30m #trapduration 30s alert mail.alert email_address I'm redirecting SNMP traps from the router to mon using snmptrap2mon.pl. The thing is that if I redirect linkUp and linkDown traps, the service never come down and mon never sends and alert even when there are more than 2 transitions (linkDown linkUp). If only linkDown traps are redirected, mon sends the alerts as it should but the services is always down (it shows up on red on moncgi) after a flapping occurs, which bothers me. This is mainly because no linkUp traps are redirected. I've tried option trapduration 30s but on the lastet CVS release mon complains with unknown syntax [trapduration 30s], line 59. trapduration is a configuration option that belongs in a service definition, but outside of a period definition. The current Mon code is much more strict about options that are misplaced, where earlier versions would just ignore those options. I'd like to have the service on green while there is no flapping and, if there is flapping (3 interface transitions during 30 secs), put in on red during 30 min and bring it back to green if no more flapping happens. trapduration won't quite get you this behavior. When the trap status expires the service will go back to the untested state, not the OK state. This may be acceptable to you... To get the behavior you describe you need to make snmptrap2mon.pl to send a mon trap with status OK for linkUp traps, I suspect its currently sending failure alerts for both linkDown and linkUp traps. -David ___ mon mailing list mon@linux.kernel.org http://linux.kernel.org/mailman/listinfo/mon
Re: Question on Redistribute
--On Thursday, August 24, 2006 10:14:48 -0400 Jim Trocki [EMAIL PROTECTED] wrote: On Thu, 24 Aug 2006, David Nolan wrote: --On Thursday, August 24, 2006 08:21:16 -0500 Tim Carr [EMAIL PROTECTED] The problem is that we're going to need to turn the monitoring period for several of the remote site monitors in each location way up - like checking every 10 seconds (i.e., interval 10s). That mean we're going to see a huge increase in the number of traps we're seeing at the corporate site. Or we could implement a redistributeevery option, similar to alertevery. That wouldn't be too hard, but would take a little work. yeah the issue here is the processing and communication overhead of dealing with the traps sent remotely. it would make sense to batch up the 10s traps from the remote systems and send them out in a bundle say, once every minute, and that would, you know, save you 6x the processing overhead on the remote mon server, or at least give you a way to control the processing overhead to suit your needs. this use case might mean that it would make sense to move the remote trap stuff into the mon server itself, rather than implement it with the trap alert. the trap alert is a nice simple abstraction that works well for the simpler cases, and an elegant way of extending the functionality of mon without having to change the server code, but at the cost of efficiency. you would really want the ability to batch up only the trap transmissions rather than all alerts. for example, schedule a trap queue flush every minute performed by the mon server rather than in the trap alert. I could see benefits to that capability, in addition to the current redistribute support. My original idea for redistribute was that it could be used to integrate mon with other systems as well, because its just an arbitrary script that you can provide. i.e. it could send status updates to Open View, or log status updates to a database, or anything else you might want. The ability to use it for integration with remote mon servers is just a bonus... then this brings up the issue of trap processing overhead on the rx end. i wonder if the behavior would be acceptable by just processing the trap receptions serially, the way it is done now, or if it would require a change in processing method to scale it up efficiently. For the record, my master server is a 2.8Ghz P4, and basically runs at zero load while processing the trap load I described earlier, and running a few tests of its own. I'm sure there is a limit to reasonable trap load, but we haven't hit it yet. this probably requires much more thought and a better understanding of the usage scenario. I agree. I suspect Tim's usage scenario involves large numbers of servers sending monitoring relatively small environments, so I doubt he'll have any processing load problem. But we're not quite sure of the scale of Tim's setup. -David ___ mon mailing list mon@linux.kernel.org http://linux.kernel.org/mailman/listinfo/mon
RE: Question on Redistribute
--On Thursday, August 24, 2006 10:18:56 -0500 Tim Carr [EMAIL PROTECTED] wrote: 4000 traps/second. That sounds like a whole lot to me. Holy cr** thats a lot of traps. Wow, the interesting ways that mon gets deployed continue to amaze me... Even if you were only sending one trap per minute per service you would have: 25 service * 1 trap/minute * 2 servers * 1200 site = 6 traps/minute, or 1000 traps per second. That still *lot* of traps. Doing your bandwidth math shows that it still 1.6Mbps of trap traffic. I think you might want to make your mon setup more structured, with intermediate collection points that pass status changes only to your final collection point. -David ___ mon mailing list mon@linux.kernel.org http://linux.kernel.org/mailman/listinfo/mon
RE: Problem getting traps to work correctly
--On Thursday, July 13, 2006 14:01:58 -0500 Tim Carr [EMAIL PROTECTED] wrote: A question on the redistribute option, though - I'm not sure I can follow how the configuration works. For example, my current remote server config is: redistribute is a service level config option, not a period option. For example: watch Store13-2 service DRBD_Status interval 15s monitor DRBDCheck.monitor -s you description Is\ DRBD\ working\ there? redistribute trap.alert mainmonitor -David ___ mon mailing list mon@linux.kernel.org http://linux.kernel.org/mailman/listinfo/mon
RE: Problem getting traps to work correctly
--On Thursday, July 13, 2006 14:20:38 -0500 Tim Carr [EMAIL PROTECTED] wrote: Gotcha. I threw that in, and it seems to work correctly, except I can't tell if it is or not. I'm watching the log file, and it shows alerts being sent on an up/down event, but I'm not seeing alerts every 15s showing up when things are working correctly. Is that expected behavior? Thanks, Tim I refer to the server that sends the traps as a slave server, and the server collecting the traps as the master server. Your master server should receive a trap on every status update on the slave server, i.e. a trap every 15s in your example. The master should only alert based on its alert behavior. This makes receving updates via traps almost functionally equivelant to other monitor tests that you run on your master server. If thats not the behavior you're seeing please let me know. -David ___ mon mailing list mon@linux.kernel.org http://linux.kernel.org/mailman/listinfo/mon
Re: Problem getting traps to work correctly
On 7/13/06, Tim Carr [EMAIL PROTECTED] wrote: Here's a bit more information on it. I've got the slave server configured for multiple services, each of them using the redistribute option: redistribute alert trap.alert mainmonitor If thats an exact quote you've got the option wrong. Its just redistribute trap.alert mainmonitor. On the master server, once I've reset it, none of those servers will ever go green/good in mon.cgi - they stay in blue/unchecked status. That sounds like you've still got the period based trap configuration in place. (Which would match with the above typo.) If thats not true, and the line above was a typo in the email not the configuration, then maybe the redistribute code in CVS is broken. Before I go investigate that possibility please confirm whether the line above was an exact quote from your config file. In the slave server, the history file shows this for an outage event: alert Store13-2 DRBD_Status 1152819579 /opt/mon/alert.d/trap.alert (mainmonitor) DRBD_Not_Running upalert Store13-2 DRBD_Status 1152819594 /opt/mon/alert.d/trap.alert (mainmonitor) DRBD_Not_Running This also indicates to me that your old alert/upalert configuration is still in place, because redistribute does not generate history entries, because doing so would bloat the history file on the slave server. -David ___ mon mailing list mon@linux.kernel.org http://linux.kernel.org/mailman/listinfo/mon
Re: mon-alert-script won't execute
--On Wednesday, July 05, 2006 13:46:44 +0200 Felix Leiter [EMAIL PROTECTED] wrote: mon recognizes at 03:29:37 03:29:47 that the port 8080 is closed and calls the squid.alert at 03:29:47 but then nothing happens. I don't now where the misconfiguration is. I try to change the squid.alert-script to this: # !/bin/sh # # /etc/init.d/squid start the acl for squid.alert is set to 755, this should also be alright. does anyone has any sugestions? kind regards Felix, What user is mon running as? If its not running as root it probably cannot restart squid. What OS are you running. Is the squid init script refusing to start squid because it thinks its already running? Have you tried running your script by hand and seeing whether that restarts squid? In your message the first line of the script is '# !/bin/sh', while it should be '#!/bin/sh'. i.e. no space between # and !. But that might have just been a typo in your email... -David ___ mon mailing list mon@linux.kernel.org http://linux.kernel.org/mailman/listinfo/mon
Re: pager alert continues to page multiple times after single failure
--On Thursday, April 20, 2006 09:10:53 -0400 Brendan Mullen [EMAIL PROTECTED] wrote: I was migrating our instance of Mon to a new machine running a newer version. The problem was traced to a locally modified version of the qpage alert. If I had used the qpage.alert that shipped with the version of Mon I was using, I would have been fine. The locally modified qpage.alert worked on an older version of Mon, but not 1.0pre5 The page would be sent but never show up in the alert history, and then would be sent again. and again... Interesting. Can you explain the cause of the failure? Was qpage.alert exiting with an error code that made Mon think it need to re-try the alert? -David Nolan Network Software Designer Computing Services Carnegie Mellon University ___ mon mailing list mon@linux.kernel.org http://linux.kernel.org/mailman/listinfo/mon
Re: pager alert continues to page multiple times after single failure
--On Wednesday, April 19, 2006 12:09:30 -0400 Brendan Mullen [EMAIL PROTECTED] wrote: Hello, I'm using mon-1.0.0pre5, the mon-client-1.0.0pre5 and mon.cgi2.2 When an alert is triggered, it continues to page every minute like it is paging on every watch interval. The config looks fine. Is your monitor's summary line changing? (That would cause a re-alert). -David David Nolan*[EMAIL PROTECTED] curses: May you be forced to grep the termcap of an unclean yacc while a herd of rogue emacs fsck your troff and vgrind your pathalias! ___ mon mailing list mon@linux.kernel.org http://linux.kernel.org/mailman/listinfo/mon
Re: monitoring parameters
--On Wednesday, February 22, 2006 16:46:59 -0600 Nate Reed [EMAIL PROTECTED] wrote: I'm not sure if I have set the monitoring parameters correctly for what I want to do. First question: is the monitoring interval the frequency that mon runs the monitor, or does it define something else? I hate to quote the documentation, but from the manual: interval timeval The keyword interval followed by a time value specifies the frequency that a monitor script will be triggered. So 'interval 30s' means that mon will run the monitor test every 30 seconds. It seems like MON is forgetting about the previous alert after the monitoring interval has elapsed (MON_FIRST_FAILURE and MON_LAST_FAILURE are equal even though there were numerous failures). Is that what's supposed to happen? First and last failure should be the same in certain cases, depending on how long the failure has been happening. first failure is an indication of when the current failure started, last failure is an indication of when the most recent monitor test was run. So if your interval is 5 minutes, for the five minutes immediately following the first detection of a failure first and last will be the same. Ideally, my monitor would run very frequently (every few seconds), but the monitoring interval would be longer, like 30 minutes. Upon on a second failure during the monitoring interval, my alert script will try to take a different action than on the first failure. Is this possible through Mon's configuration (without building this logic in my script)? You can do this. The interval setting configures the testing behavior, the alert period definitions configure the alerts (actions) that will occur. You can have multiple periods with different behaviors for different failure lengths or different times of day. For example, look at these two periods: period first_action: wd{Sun-Sat} alertafter 1 alert some.alert.script -some -arguments numalerts 1 period second_action: wd{Sun-Sat} alertafter 30m alert some.other.alert.script -some -arguments alertevery 30m Those would run some.alert.script immediately whenever a failure occurs, and some.other.alert.script after the failure has been continous for half an hour and every half hour after that. See the manual for full information on all the alert control semantics that are available. -David Nolan Network Software Designer Computing Services Carnegie Mellon University ___ mon mailing list mon@linux.kernel.org http://linux.kernel.org/mailman/listinfo/mon
Re: capture ip of falling node
--On Tuesday, November 29, 2005 10:22:33 +0100 Andrés Cañada [EMAIL PROTECTED] wrote: Hi! I posted this message in another list and I think this is the right place for it. Andrés Cañada wrote: Hi all! I have a cluster working with heartbeat and ldirectord, systemimager, ganglia and mon. It's working nice already (thanks to everybody in this list!). I use Mon to monitor the cluster nodes with snmpd. When one of the criteria is positive, then Mon sends me an alert to my mail. That's great!! But now I'd like to be able to capture that sign sended by Mon to run a script. I don't know if I'm explaining well. When ,in example, a node fails to a ping-check, I'll receive an e-mail notification, but I'd like also to be able to capture this signal to run a script. Can anybody tell me if snmptrapd is ideal for this issue to solve? Is there a HOWTO for this? thank you very much and sorry for my english. Andres Why don't you just write your own alert script for Mon and have Mon run it? Thanks for your answer. I'd like it to be so easy but I'm afraid it isn't. In my case I need to know the ip of the falling node and then trigger a script that makes something with the rest of the nodes (I need to modify the setup a mpi universe). It seems to be possible to do this since the ip of the falling node is received via mail. Any ideas? Should I need to use snmptrapd?? Thank you very much Andrés. I think you missed the point of the previous response. The mail you're getting from Mon is being generated by an alert script. If you think there is enough information in that mail to take an action on the failing node, then there is enough information available to an alert script to just have the script take the action. Alert scripts in Mon are simply programs which take information in specific ways from Mon and perform some form of action on that data. They can be shell scripts, perl programs, C programs, etc. They can take whatever action you deem reasonable. The command line arguments and environment variables that are passed to alert scripts are documented in the Mon man page. -David David Nolan*[EMAIL PROTECTED] curses: May you be forced to grep the termcap of an unclean yacc while a herd of rogue emacs fsck your troff and vgrind your pathalias! ___ mon mailing list mon@linux.kernel.org http://linux.kernel.org/mailman/listinfo/mon
Re: alert does not execute script compleet
--On Saturday, November 26, 2005 4:51 PM +0100 gandalf istari [EMAIL PROTECTED] wrote: Hi, I have a problem with a self written alert. This script must change two routing tables, exec a ssh command and write ta text to a log file. It does everything execept the ssh command. If i run the alert manualy all work perfectly. this script is crutial in our failover setup. snip # Change route at UCC to framerelay ssh [EMAIL PROTECTED] /usr/lib/mon/alert.d/use-framerelay.alert If I were a betting man, I'd bet money that ssh isn't in the PATH of the user that Mon is running as. Try adding the full path to your ssh binary. -David David Nolan*[EMAIL PROTECTED] curses: May you be forced to grep the termcap of an unclean yacc while a herd of rogue emacs fsck your troff and vgrind your pathalias! ___ mon mailing list mon@linux.kernel.org http://linux.kernel.org/mailman/listinfo/mon
Re: Failure do not become Alert !
--On Friday, November 11, 2005 10:47:39 +0100 GioiaBa [EMAIL PROTECTED] wrote: SNIP period wd {Sat-Sund} SNIP last day, the Router service went down, so 'Ext' watching began to fail.. the problem is that the failure leght has been 1h 39 mintues !! and never became Alert.. so no Alert has been sent for that hour.. this would be a great problem, as the service we are monitoring is our Router connectivity.. any ideas on the reason why this could happen ? Was the failure on a Saturday or Sunday? Your period definition is for weekends only. Perhaps you want 'period wd {Sun-Sat}'. Or simply an empty period definition will match always. ..and we also need to monitor the responding time of the service.. I mean the service 'fails' only if the fping did not respond in xxminutes .. I've read before how to do it, but I can't find it right now.. Any help would be appreciated.. thank you very much I'm not sure exactly what you're asking. If you want to control the detection behavior of a single service test, look at the command line options for the monitor scripts you're using. For example, fping.monitor takes the command line options that you can use to control the ping timeout behavior: -r num retry num times for each host before reporting failure -s num consider hosts which respond in over num msecs failures -t num wait num msecs before sending retries -David David Nolan*[EMAIL PROTECTED] curses: May you be forced to grep the termcap of an unclean yacc while a herd of rogue emacs fsck your troff and vgrind your pathalias! ___ mon mailing list mon@linux.kernel.org http://linux.kernel.org/mailman/listinfo/mon
Re: Alerts coming too often... why?
--On Thursday, November 03, 2005 12:29:55 -0500 Bill [EMAIL PROTECTED] wrote: I have it set to alert every 60minutes, but I get them about every 5 minutes. In reading the doc's I noticed the results have to be the same for each entry otherwise it resents. I noticed the fping entries in my log are different. Yes, the default behavior is that if the summary of the failure changes a new alert should be generated. If you're running the current Mon from CVS you can control that by saying 'alertevery 60m strict'. Alternatively you could figure out what your alert is generating inconsistent output. Based on this string from your syslog output, unidentified output from fping, I'm guessing your alert script isn't corretly processing all of the fping output. I believe you might need a newer version of the fping.monitor script. If the latest version from CVS doesn't help send us the version iformation for you version of fping and we'll see if we can fix it. -David David Nolan*[EMAIL PROTECTED] curses: May you be forced to grep the termcap of an unclean yacc while a herd of rogue emacs fsck your troff and vgrind your pathalias! ___ mon mailing list mon@linux.kernel.org http://linux.kernel.org/mailman/listinfo/mon
Re: Alerts coming too often... why?
--On Thursday, November 03, 2005 13:49:15 -0500 Bill [EMAIL PROTECTED] wrote: So is the cvs relatively stable? Mon is not mission critical stuff here, so I'd be more than happy to run that on a bunch of machines. Right now I am on 0.99.2 I was eyeing CVS the other day... debating it. CVS is definitely more stable then 0.99.2. 0.99.2 has some nasty bugs, including some crash and burn type bugs. I need to spend some time integrating some last bug fixes to CVS and then we're ready to call it a release. -David David Nolan*[EMAIL PROTECTED] curses: May you be forced to grep the termcap of an unclean yacc while a herd of rogue emacs fsck your troff and vgrind your pathalias! ___ mon mailing list mon@linux.kernel.org http://linux.kernel.org/mailman/listinfo/mon
Re: Main Groups?
--On Thursday, October 27, 2005 6:57 AM +0200 Frank 'eXplasm' Isemann [EMAIL PROTECTED] wrote: and for example on s3 doesnt run a ftp server .. how can i exclude the ftp service from this special server? From the service definitions portion of the documentation: exclude_hosts host [host...] Any hosts listed after exclude_hosts will be excluded from the service check. -David David Nolan*[EMAIL PROTECTED] curses: May you be forced to grep the termcap of an unclean yacc while a herd of rogue emacs fsck your troff and vgrind your pathalias! ___ mon mailing list mon@linux.kernel.org http://linux.kernel.org/mailman/listinfo/mon
Re: A few bugfixes missing from fantabulous mon...
--On Friday, October 21, 2005 16:52:04 -0400 Ed Ravin [EMAIL PROTECTED] wrote: Oh, but just ignore that last patch set, that's totally the wrong one. I'm surprised no one tweaked me on it - maybe no one ever reads my mail all the way down to the bottom? I read it all, but hadn't gotten around to looking at the patches yet. (I guess I was hoping Jim would. He was probably hoping I would... :) I'll try to look at them this weekend. -David David Nolan*[EMAIL PROTECTED] curses: May you be forced to grep the termcap of an unclean yacc while a herd of rogue emacs fsck your troff and vgrind your pathalias! ___ mon mailing list mon@linux.kernel.org http://linux.kernel.org/mailman/listinfo/mon
Re: how does exclude_period work?
--On Tuesday, October 11, 2005 17:24:27 +0200 Sebastiaan Veldhuisen [EMAIL PROTECTED] wrote: Hi David, I just committed a new version of Mon to CVS with *UNTESTED* support for a global exlude_period. Download the latest from the sourceforge CVS repository and put 'exclude_period = wd {Mon} md {8-14} hr {17-23}' into your config file, next to the other global settings. (You'll also need the current version of Mon::Client, since there were some protocol changes between your version (0.99.2) and the 1.1 series. That's great news! Thanks for the enhancement :) I'll test it in the next days, but unfortunately i'm not allowed to use my own compiled code in production machines. I'll have to wait until Suse updates its rpms. Sorry to hear your employer ties your hands like that. 0.99.2 has some serious problems, including some that can trigger a perl bug that results in a perl segfault. You're trying to put the exclude_period definition inside a period. Put it above the first period definition and it should work. (And in current Mon code this would generate a config file syntax error.) I putted exclude_period above all other period definitions and now i get a syntax error and mon won't start. What do you mean with current? Do you mean CVS? Right now I'm using version 0.99.2. Does it mean it is not possible to use an exclude period with my mon version? I don't remember whether the 1.0pre* series has this fix, but in 1.1* a unrecognized option in the period section will result in an error. Can you post a snippet of your current config and the resulting error message? -David David Nolan*[EMAIL PROTECTED] curses: May you be forced to grep the termcap of an unclean yacc while a herd of rogue emacs fsck your troff and vgrind your pathalias! ___ mon mailing list mon@linux.kernel.org http://linux.kernel.org/mailman/listinfo/mon
Re: how does exclude_period work?
--On Tuesday, October 11, 2005 22:32:51 +0200 [EMAIL PROTECTED] wrote: Quoting David Nolan : Sorry to hear your employer ties your hands like that. 0.99.2 has some serious problems, including some that can trigger a perl bug that results in a perl segfault. Yeah I know. It's not that i'm not capable of it. The company chose to stick to an Enterpise Linux version so they can get support tickets from Suse on the software. Good news is, that this is my last month working for them :o) I'll compile CVS on my Debian Sarge machine and test your enhancement. I know about the problems with 0.99.2, but so far (I'm lucky I guess) I haven't had any problems with segfaulting. The segfault bug is trigger by calling a text parsing function (from a standard perl module, Text::Parsewords) with particulary large input. The ways I've seen this triggered are parsing monitor output and parsing trap input. I'd bet money it could probably be triggered by a large client request, but I just fixed the problem by not using that routine any more. cf error: unknown syntax [exclude_period wd {Mon} md {1-7} hr {17-22}], line 69 Oh shoot... Now that I go look at the code to find where that comes from I remember that 0.99.2 had a complete parsing bug on exclude_periods that prevented them from ever working. Basically this code: elsif ($var eq exclude_period inPeriod (time, $args) == -1) { close (CFG); return cf error: malformed exclude_period '$args' (the specified time period is not valid as per Time::Period::inPeriod), line $line_num; } needs to become this code: elsif ($var eq exclude_period) { if (inPeriod (time, $args) == -1) { close (CFG); return cf error: malformed exclude_period '$args' (the specified time period is not valid as per Time::Period::inPeriod), line $line_num; } } the previous code was always falling through to the else clause. Jim was talking with me recently about actually designating something a stable version... This seems like one more big reason to stop calling 0.99.2 the stable version. How about it Jim? Call mon-1-1-0pre2 Mon 1.1 and cut a release? -David David Nolan*[EMAIL PROTECTED] curses: May you be forced to grep the termcap of an unclean yacc while a herd of rogue emacs fsck your troff and vgrind your pathalias! ___ mon mailing list mon@linux.kernel.org http://linux.kernel.org/mailman/listinfo/mon
Re: not able to send sms alert
--On Wednesday, October 05, 2005 2:39 PM +0530 ankush grover [EMAIL PROTECTED] wrote: hey friends, I am trying to configure sms alerts for my servers.But I am getting the errors calling alert sms.alert for apache2/HTTP (/usr/lib/mon/alert.d/sms.alert,my number) 192.168.1.68 http://192.168.1.68 Oct 5 14:28:42 linux mon[6664]: could not exec alert /usr/lib/mon/alert.d/sms.alert: No such file or directory Either the file /usr/lib/mon/alert.d/sms.alert doesn't exist, its not executable, or the binary referenced in the first line (/usr/bin/perl) doesn't exist. If you try to run the alert by hand you should see the same error. However I suspect you have a bigger problem. I suspect you have not yet read the README for sms.alert and realized that sms.alert requires having gnokii installed, and a Nokia cell phone connected to your computer. You probably want to look at some other form of SMS transmission. We use snpp.alert to talk to a SNPP server that dials a modem and sends a message via a TAPS/IXO dialup message transmission interface. However many phone providers don't offer that service anymore, so we often use email to the various cellular provider's email/sms gateways. i.e. [EMAIL PROTECTED], etc. Personally I don't feel that relying on SMS messaging for mission critical notifications is a good idea. We *primarily* use text messaging with SkyTel pagers, since SkyTel actually provides a reliable messaging service. Every cellular provider that I've check says their text messaging service is for 'entertainment purposes only'. i.e. not reliable for business purposes. -David David Nolan*[EMAIL PROTECTED] curses: May you be forced to grep the termcap of an unclean yacc while a herd of rogue emacs fsck your troff and vgrind your pathalias! ___ mon mailing list mon@linux.kernel.org http://linux.kernel.org/mailman/listinfo/mon
Re: [Solaris 9] log output to terminal
--On Monday, October 03, 2005 11:54:40 +0200 Alexandre Pashai [EMAIL PROTECTED] wrote: hi all, mon daemon outputs logs into logfile (normally). On Solaris 9, the log is sent to other terminals...that's annoying. what's the matter ?? thanks fro replies Mon is just using syslog. Either you have mon syslog'ing to a facility that gets re-broadcast everywhere, or you have a syslog.conf that is sending too much information to the user terminals. From the mon manual: syslog_facility = facility Specifies the syslog facility used for logging. daemon is the default. Check your syslog.conf to see how logs from daemon are configured. -David David Nolan*[EMAIL PROTECTED] curses: May you be forced to grep the termcap of an unclean yacc while a herd of rogue emacs fsck your troff and vgrind your pathalias! ___ mon mailing list mon@linux.kernel.org http://linux.kernel.org/mailman/listinfo/mon
Re: failed logins monitor
--On Monday, September 26, 2005 13:58:01 +0200 Administrator Chat-Net [EMAIL PROTECTED] wrote: hi all, on the webpage of intrusion[1] i saw that they have a login_failure monitor. is that monitor still avalaible or is there another who does replace it? thx for reply greetz [1] http://www.intrusion.com/knowledge/article.aspx?ID=611166 My impression from reading that site is that the monitor scripts reference are proprietary scripts written by Intrustion Inc., provided as part of the SecureNet Sensor product they sell. I'd guess that their script wouldn't be useful outside of their box anyway, since it probably is looking at pre-collected data from their system. For a general purpose monitor script you'd probably want something that parses syslog output. There is a syslog.monitor included with mon that serves as a syslogd replacement, but I've never personally used it. (I didn't like the 'must replace syslogd' requirement..) I have a similar tool which watches the syslog log files and pattern matches on the output, generating mon traps as necessary. I could probably add it to the mon CVS area if anyone is interested in using it... -David David Nolan*[EMAIL PROTECTED] curses: May you be forced to grep the termcap of an unclean yacc while a herd of rogue emacs fsck your troff and vgrind your pathalias! ___ mon mailing list mon@linux.kernel.org http://linux.kernel.org/mailman/listinfo/mon
Re: mon configuration
--On Thursday, September 08, 2005 10:59 AM -0400 Allan Wind [EMAIL PROTECTED] wrote: On 2005-09-08T16:07:16+0200, Graf László wrote: I am using a shell script wich runs in the background. How should I configure mon to alert me if the process hangs up or fail in operations? If you mean process dying with hangs up then you could use the ps.monitor that is in contrib. If you mean stop working as expected or hangs then it is a unresolved problem in general, and you need to look into making decisions based on timeouts. For instance have your script touch a file then write a monitor to alert if you if that file is too old. Should you mean a signal perhaps you want to trap that? /Allan In addition to Allan's suggestions I would also suggest looking into Mon traps. If your long lived process runs a program that generates a Mon trap every X period of time (1 minute, 10 seconds, whatever...) then you could have a trap timeout in Mon to detect when it stopped generating traps. -David David Nolan*[EMAIL PROTECTED] curses: May you be forced to grep the termcap of an unclean yacc while a herd of rogue emacs fsck your troff and vgrind your pathalias! ___ mon mailing list mon@linux.kernel.org http://linux.kernel.org/mailman/listinfo/mon
Re: monitor for ssl services?
--On Saturday, August 20, 2005 2:41 PM +0200 Miolinux [EMAIL PROTECTED] wrote: Hi, i searched and googled quite, but i didn't find a monitor for monitoring ssl services (i needed mail server one's: smtps,imaps,pop3s) [SSL not TLS] and didn't want to create an ssl tunnel for each one of them, so i modified tcpch.monitor and merget it with some parte of an imap-ssl.monitor that i found. Now are some weeks that i run it and seems to work, but i bet i made some error since i'm not a perl programmer. However since may be interesting (and someone could take a look at it) i'll attach the code. Ps. if someone does know an alternative to this code don't hesitate to talk! ;) I guess I never got around to adding CMU's imap tests to the contrib area. I've done that now, at least in CVS. As soon as the public copy of the sourceforge CVS repository is updated you'll find a imap directory visible here http://cvs.sourceforge.net/viewcvs.py/mon/mon-contrib/monitors/ which will contain three tests, one for IMAP over SSL, one for IMAP with STARTTLS, and one for plain text password authentication over IMAP. (The PTP test has support for a new monitor-auth.cf file to specify username and password, but I haven't added the documentation for that to the Mon repository yet. I'll work on that. It also can take user/password on the commandline.) The IMAP over SSL test has support for alerting when an SSL certificate is expired, or about to expire. We run two services with this test on our servers, one without certification notification, and one with. The one with certificate notification enabled is configured never to page, it just sends mail 10 days before the cert is going to expire. I think I'll go look and see what other monitors I have now that I should export... I've probably got a dozen or so to add. (Plus the docs for the monitor-auth.cf syntax...) -David David Nolan*[EMAIL PROTECTED] curses: May you be forced to grep the termcap of an unclean yacc while a herd of rogue emacs fsck your troff and vgrind your pathalias! ___ mon mailing list mon@linux.kernel.org http://linux.kernel.org/mailman/listinfo/mon
Re: mon writes to /var/log/messages
--On Wednesday, August 17, 2005 1:08 PM +0200 Grames Gernot [EMAIL PROTECTED] wrote: Hi, i found out that the mon writes a lot of messages to the var/log/messages file during monitoring. How can i stop this?? It fills out my harddisk! Thank you! By reading the documentation for mon and syslog, and picking a configuration which suits your needs. By default mon logs to the 'daemon' syslog facility, and logs various messages at the debug, info, notice, alert, err, and crit syslog levels. I suspect you're logging daemon.info and higher messages to your messages file. Either log only higher level messages, or change mon to log to a facility you don't output to disk. Alternatively, use any of the miriad systems available that perform logfile rotation, so you don't keep your syslogs forever. If you *really* want no syslog output at all, modify the code to add that feature as an option and send us a patch. -David David Nolan*[EMAIL PROTECTED] curses: May you be forced to grep the termcap of an unclean yacc while a herd of rogue emacs fsck your troff and vgrind your pathalias! ___ mon mailing list mon@linux.kernel.org http://linux.kernel.org/mailman/listinfo/mon
Re: Problem with depedencies
--On Thursday, July 28, 2005 11:37 AM +0200 \rueh hänä\ [EMAIL PROTECTED] wrote: Is something wrong with my dependencies? Or is it not possible to make more than one service depending on another service? And, are dependencies over different hostgroups possible? I think the problem is that you're expecting more from dependencies then they provide. Assuming you have dependency behavior set to 'm' what will happen is that test X won't be run if test Y has already detected a failure. But if the failure occurs *between* when Y was last run and when X is run then X will detect the failure first. The answer to this is to have virtually every service have at least an 'alertafter 2' setting, so that two consecutive failures have to be detected, and have the higher-order tests have shorter test intervals. i.e. for my web servers I have mon configured to ping them every 30 seconds, check their load average via snmp every 45 seconds, test http every minute, and test https every five minutes. (And the router between my monitoring host and the web server is pinged every 15 seconds...) -David David Nolan*[EMAIL PROTECTED] curses: May you be forced to grep the termcap of an unclean yacc while a herd of rogue emacs fsck your troff and vgrind your pathalias! ___ mon mailing list mon@linux.kernel.org http://linux.kernel.org/mailman/listinfo/mon
Re: question concerning monitor scripts
--On Monday, July 18, 2005 11:21 AM +0200 \rueh hänä\ [EMAIL PROTECTED] wrote: Any hints to this ? Or is a bash-script based monitor possible, too? How would this work? Mon's monitor programs can be any executable format you choose. We have some that are actually compiled C code. If you're most familiar with shell scripts, write a shell script. It should behave the way the documentation says a monitoring program should behave. The relevant passages from the documentations are: Monitor processes are invoked with the arguments specified in the configuration file, appended by the hosts from the applicable host group. should return an exit status of 0 if it completed successfully (found no problems), or nonzero if a problem was detected. The first line of output from the monitor script has a special meaning: it is used as a brief summary of the exact failure which was detected, and is passed to the alert program. All remaining output is also passed to the alert program, but it has no required interpretation. -David -David Nolan Network Software Designer Computing Services Carnegie Mellon University ___ mon mailing list mon@linux.kernel.org http://linux.kernel.org/mailman/listinfo/mon
Re: mon.cf features like redistribute ?
--On Tuesday, July 05, 2005 3:16 PM +0200 Jacques Klein [EMAIL PROTECTED] wrote: Hello, I downloaded mon-1.1.0pre1.tar.gz, made some experiments with it and now I am looking for an up-to-date documentation of this version, essentially a good description of the mon.cf syntax and the maybe new feature redistribute. The lack of documentation is why its 1.1pre instead of 1.1. :) I'll try to work on the documentation this weekend. For starters, I'll add this section: redistribute alert [arg...] A service may have one redistribute option, which is a special form of an an alert definition. This alert will be called on every service status update, even sequential success status updates. This can be used to integrate Mon with another monitoring system, or to link together multiple Mon servers via an alert script that generates Mon traps. See the ALERT PROGRAMS section above for a list of the parameters mon will pass automatically to alert programs. -David David Nolan*[EMAIL PROTECTED] curses: May you be forced to grep the termcap of an unclean yacc while a herd of rogue emacs fsck your troff and vgrind your pathalias! ___ mon mailing list mon@linux.kernel.org http://linux.kernel.org/mailman/listinfo/mon
Re: Help !!!
--On Wednesday, July 06, 2005 10:04 AM +0800 D K [EMAIL PROTECTED] wrote: hi, I am a Chinese, my english is poor, so I hope you can understand this letter. Yesterday I use Mon to monitor a server, I hope Mon can alert via xmpp protocol, so I wrote a alert file, but mon seem not work. If I execute alert file, it canalert to my jabber. Now I hope mon can monitor a services is down, it can alert to my jabber. Please Help me! I wait your reply! Thanks!!! A Helper 2005.7.6 Your english isn't too bad, but your problem reporting skills definitely need some work. In order to be able to help you, we need to know in what way mon isn't working. What did you do, what behavior did you expect, and what behavior did mon show? Is mon detecting your failure and not calling the alert, or calling the alert but it fails to behave as desired? Or is mon not detecting the failure at all. If your script works when you run it but not when Mon runs it, the most likely causes are: - $PATH differences (i.e. your script is running some program that appears in your PATH but not in the PATH that mon provides to the alert script) - privilege difference. i.e you ran your test as root but Mon is running as nobody, or similar. We need more information in order to provide any better guidance for solving your problem. -David David Nolan*[EMAIL PROTECTED] curses: May you be forced to grep the termcap of an unclean yacc while a herd of rogue emacs fsck your troff and vgrind your pathalias! ___ mon mailing list mon@linux.kernel.org http://linux.kernel.org/mailman/listinfo/mon
Re: wrapping long lines in mon.cf, how?
--On Tuesday, June 28, 2005 4:39 PM -0500 [EMAIL PROTECTED] wrote: Can long lines in mon.cf be gracefully wrapped, as: hostgroup testing_server_network 172.16.0.1 172.16.0.2 172.16.0.3 \ 172.16.0.4 172.16.0.5 or does this mess things up? Is this possible in mon.cf: hostgroup My Server Group watch My Server Group ... From the man page: Lines are parsed as they are read. Long lines may be continued by ending them with a backslash (\). If a line is continued, then the backslash, the trailing whitespace after the backslash, and the leading whitespace of the following line are removed. The end result is assembled into a single line. Also from the man page: Hostgroup entries begin with the keyword hostgroup, and are followed by a hostgroup tag and one or more hostnames or IP addresses, separated by whitespace. The hostgroup tag must be composed of alphanumeric characters, a dash (-), a period (.), or an underscore (_). Non-blank lines following the first hostgroup line are interpreted as more hostnames. The hostgroup definition ends with a blank line. And: Watch entries begin with a line that starts with the keyword watch, followed by whitespace and a single word which normally refers to a pre-defined hostgroup. If the second word is not recognized as a hostgroup tag, a new hostgroup is created whose tag is that word, and that word is its only member. -David David Nolan*[EMAIL PROTECTED] curses: May you be forced to grep the termcap of an unclean yacc while a herd of rogue emacs fsck your troff and vgrind your pathalias! ___ mon mailing list mon@linux.kernel.org http://linux.kernel.org/mailman/listinfo/mon
Re: libperl.so.1 failed
--On Thursday, June 09, 2005 4:22 PM -0400 Kishore Jalleda [EMAIL PROTECTED] wrote: Hi David, Thanks for the reply, actually perl mon, works fine but not ./mon doesn't ? Kishore (Lets keep this on the mailing list, so others can follow along. Especially since I'm going on vacation tomorrow... :) Sounds like you've got two perl installations on your machine, and the one that mon is using isn't the one thats in your PATH. Compare the output of 'which perl' and 'head -1 mon' -David David Nolan*[EMAIL PROTECTED] curses: May you be forced to grep the termcap of an unclean yacc while a herd of rogue emacs fsck your troff and vgrind your pathalias! ___ mon mailing list mon@linux.kernel.org http://linux.kernel.org/mailman/listinfo/mon
Re: mon user???
--On Wednesday, June 08, 2005 9:41 AM +0200 Sylvain Clerc [EMAIL PROTECTED] wrote: Hello, I would know if a special user for Mon is created during the installation because Mon hasn't permissions to execute my alert script (start or stop Heartbeat) and I want to try using sudo for resolve my problem. Mon runs as whatever user you chose to run it as. Some places run it as root, some places run it as another user. Try running 'ps waux | grep mon' to find what user its running as. -David David Nolan*[EMAIL PROTECTED] curses: May you be forced to grep the termcap of an unclean yacc while a herd of rogue emacs fsck your troff and vgrind your pathalias! ___ mon mailing list mon@linux.kernel.org http://linux.kernel.org/mailman/listinfo/mon
Re: libperl.so.1 failed
--On Wednesday, June 08, 2005 3:36 PM -0400 Kishore Jalleda [EMAIL PROTECTED] wrote: Hi, I am tring to install mon on Solaris8/Sparc , Perl version installed is 5.8.5,I also installed all the perl modules required for mon, when I try to run mon, or test any of the monitors , i get an error, ld.so.1: mon: fatal: libperl.so.1: open failed: no such file or directory , there is no libperl.so.1 on the system, am i missing something ...Please suggest Do you have a working perl installation? The error you are seeing is a linker error that implies that your perl installation may be broken. Try running some other perl scripts. -David David Nolan*[EMAIL PROTECTED] curses: May you be forced to grep the termcap of an unclean yacc while a herd of rogue emacs fsck your troff and vgrind your pathalias! ___ mon mailing list mon@linux.kernel.org http://linux.kernel.org/mailman/listinfo/mon
Re: no monitor found while trying to run []
--On Thursday, April 28, 2005 12:03 PM -0400 george young gry@ll.mit.edu wrote: I assume I've somehow specified a null-named monitor in the config file, but I can't find the problem. Could someone take a look? Just guessing here, but try removing the blank line from the routers:fping service entry. Also, you can run mon with debugging enabled, via '-d', to get more status information which might help track the problem. -David David Nolan*[EMAIL PROTECTED] curses: May you be forced to grep the termcap of an unclean yacc while a herd of rogue emacs fsck your troff and vgrind your pathalias! ___ mon mailing list mon@linux.kernel.org http://linux.kernel.org/mailman/listinfo/mon
Re: trap received but not acted upon.
--On Friday, April 08, 2005 5:47 PM -0700 Jim Trocki [EMAIL PROTECTED] wrote: you need to use a valid period definition, i.e. something that is meaningful to Time::Period, such as wd {Sun-Sat}. try this: I don't think thats his problem. An empty period definition is valid, it matches always. Mon handles this correctly. The problem is here: opstatus = unknown, If he's using Mon 0.99.2 (which he is, the particular error message he reported doesn't exist in the current code), that will cause exactly this error. If he's using either the latest 1.0 or 1.1 pre release, that will just be ignored completely, as the newer common process_event subroutine complete ignores this tag and only processes the return value. Hans, I suggest you should set this to either 'ok' or 'fail', depending on the trap you're processing. Or just upgrade to a newer mon and be happier. :) -David David Nolan*[EMAIL PROTECTED] curses: May you be forced to grep the termcap of an unclean yacc while a herd of rogue emacs fsck your troff and vgrind your pathalias! ___ mon mailing list mon@linux.kernel.org http://linux.kernel.org/mailman/listinfo/mon
Re: Monitor output size limitation
--On Friday, April 08, 2005 6:33 PM -0700 Jim Trocki [EMAIL PROTECTED] wrote: On Fri, 8 Apr 2005, David Nolan wrote: This is a known bug with some regexps in perl's Text::ParseWords that is tickled by large input from mon. well it's not really a bug, it's just that the default stack size is inadequate for regexps in that module. bump up the stack allocation with uname -s and you'll see the problem vanishes. Ahh, that's what it was. Back when I was seeing this problem I just remember seeing reports that it was a 'regexp bug', and didn't bother to track it down. Still, you'd think perl could do a better job of detecting the approaching stack size limit and throwing an error in that case instead of segfaulting. but it's better to have fixed the glitch with changing the code than expecting that people run with a modified stack size :) True. Even when Mon didn't segfault, the performance of those regexps on significant amounts of data was sometimes horrible. I had occasions where mon.cgi's parsing of the opstatus output was taking a minute or more. Changing the encoding so a simple split could be used instead made my mon interface load almost instantly. -David David Nolan*[EMAIL PROTECTED] curses: May you be forced to grep the termcap of an unclean yacc while a herd of rogue emacs fsck your troff and vgrind your pathalias! ___ mon mailing list mon@linux.kernel.org http://linux.kernel.org/mailman/listinfo/mon
Re: trap received but not acted upon.
--On Saturday, April 09, 2005 6:54 AM -0700 Jim Trocki [EMAIL PROTECTED] wrote: I don't think thats his problem. An empty period definition is valid, it matches always. Mon handles this correctly. oh. that's busted. i never realized that was the case, nor intended it to be so. just reviewed the pod page for Time::Period, and i see it does say mention that a valid period string is whitespace, but it doesn't say what it means. from testing the code it does return true when you give it an empty period string. i'm inclined to make mon treat the empty string as an error, since its meaning is ambiguous according to the documentation, and on principle. In the documentation for Time::Period, right after it says whitespace or the string 'none' are legal it says: If the period is blank, then any time period is assumed because the time period has not been restricted. In that case, inPeriod returns 1. If the period is none, then no time period applies and inPeriod returns 0. So this seems like documented behavior to me. Though none doesn't really make sense to ever use. I suppose it would be useful as a way to temporarily disable a period, without deleting all the contents. But it seems useful to be able to have multiple named periods that always match by doing: period page_first: ... period page_second: ... period email_log: ... Not that I'm doing that, since my periods are all programaticaly generated. But if you're building a config file by hand having to specify a period definition for something you want to always match seems silly. -David David Nolan*[EMAIL PROTECTED] curses: May you be forced to grep the termcap of an unclean yacc while a herd of rogue emacs fsck your troff and vgrind your pathalias! ___ mon mailing list mon@linux.kernel.org http://linux.kernel.org/mailman/listinfo/mon
Re: building mon
--On Thursday, April 07, 2005 4:51 PM -0500 Armand Pirvu (yahoo) [EMAIL PROTECTED] wrote: Hi , I tried to build mon and there are a couple of things. Mon is a perl script. You don't really *build* perl scripts so much as install and run them. Copy the mon program to the location of your choice, give it a config file and run it. 1. Do I need SNMP ? For Mon itself, no. Depending on which monitor scripts you want to run, maybe. Which monitor plugins you run will determine what dependencies you have. 2. What about Period.pm ? What is that for, where should it be ? You do need the Time::Period and Time::HiRes perl modules. Providing complete details of how to install these modules is really outside of the scope of the Mon documentation, but in most cases you can probably install them via CPAN. Try running 'perl -MCPAN -e shell'. You may be prompted to configure CPAN if you haven't used the CPAN module before. Its probably safe to just say 'no' at that prompt. When you get to the 'cpan' prompt type 'install Time::Period' and when that completes type 'install Time::HiRes'. If that doesn't work for you then you have a non-standard perl setup on your machine. You probably also want Mon::Client, also available from CPAN, or for download from the mon download site. -David David Nolan*[EMAIL PROTECTED] curses: May you be forced to grep the termcap of an unclean yacc while a herd of rogue emacs fsck your troff and vgrind your pathalias! ___ mon mailing list mon@linux.kernel.org http://linux.kernel.org/mailman/listinfo/mon
Re: Mon Periods
--On Thursday, March 31, 2005 7:52 AM -0800 Chad Sobotka [EMAIL PROTECTED] wrote: I have tested this out and sometimes it does work. I bring a service down, go to the web interface, and it reports Failed (No Alerts Sent). However, most of the time I get an alert. I have also tried setting the first period to just period: instead of period p1:. At what time of day were you doing your tests? And how long did you leave the service down? I assume you believe it wasn't long enough for your 'alertafter 2' to be triggered. If you change the alerts in two two periods to go to different addresses, which one is firing? -David David Nolan*[EMAIL PROTECTED] curses: May you be forced to grep the termcap of an unclean yacc while a herd of rogue emacs fsck your troff and vgrind your pathalias! ___ mon mailing list mon@linux.kernel.org http://linux.kernel.org/mailman/listinfo/mon
Re: problems with period RESTART: and RESTART_FAILED
--On Tuesday, March 29, 2005 11:25 AM +0200 Anquijix Schiptara [EMAIL PROTECTED] wrote: if i start heartbeat, all services get up without any problems. now i want to test mon, if it tries to restart the httpd-service, if i stop it. mon sees, that the service isnt running anymore, but it automatically calls the bring-ha-down.alert script in RESTART_FAILED period instead of the restart-httpd script in the RESTART period. if i comment out the RESTART_FAILED entries, it works with restarting the service. the funny thing is, this configuration worked the first time i used it, but not the next few times. and i got the examples from a linux-magazine, which should work. You have two periods defined. Neither period has an alertafter entry, so *BOTH* alerts will be called when a failure occurs. Which one is run first is random chance (probably based on the random order from a hash table key lookup.) If you want one to be called before the other you should put alertafter definitions in both periods. I suggest something like: period ATTEMPT_RESTART_FIRST: alert httpd_restart.alert alertafter 2 alertevery 30s period RESTART_FAILED: alert bring-ha-down.alert ... alertafter 1m alertevery 1m You also might want an upalert entry in the second period that would bring the heartbeat service back up. -David David Nolan*[EMAIL PROTECTED] curses: May you be forced to grep the termcap of an unclean yacc while a herd of rogue emacs fsck your troff and vgrind your pathalias! ___ mon mailing list mon@linux.kernel.org http://linux.kernel.org/mailman/listinfo/mon
Re: ORing hosts in a hostgroup (instead of ANDing) for a monitor
--On Friday, March 25, 2005 10:57 PM -0300 Raul Dias [EMAIL PROTECTED] wrote: Sorry if this is covered in the docs and I missed. Is it possible to have an monitor to OR the hosts in a hostgroup and if one SUCCEED the service is considered SUCCESS? An example for this is to have a hostgroup with a few internet hosts and fping them. If one of them succeeds then the internet conection is ok. Some may fail and the conection still be ok. However if all of them fails, then the internet conection is supposed to be considered down. Did I miss something? Is this possible? Yes. It needs to be a feature of whatever monitor script you're using. Are there any monitor scripts that do this now? Yes. If you pass '-a' to fping.monitor it will report failure only if all hosts fail to respond. Another approach is to add a threshold argument to the monitor script that causes it to allow that many hosts to be down before signaling an error. I already have a modified version of fping.monitor that does essentially that, except it exits with different error codes depending upon the number of failures. i.e. if I set the threshold to 1 then if more then one host fails to respond the script returns 255. I suppose I could just commit those changes back to the main mon version, since they're entirely optional additional features, but for now that version is available at: https://bugzilla.andrew.cmu.edu/cgi-bin/cvsweb.cgi/src/netsage/mon/mon.d/fping.monitor?rev=1.9content-type=text/x-cvsweb-markup -David David Nolan*[EMAIL PROTECTED] curses: May you be forced to grep the termcap of an unclean yacc while a herd of rogue emacs fsck your troff and vgrind your pathalias! ___ mon mailing list mon@linux.kernel.org http://linux.kernel.org/mailman/listinfo/mon
Re: BUG: alertevery filtering fails because empty summary fails match with previous empty summary
--On Thursday, March 24, 2005 9:29 PM -0800 Michael Vogt [EMAIL PROTECTED] wrote: I found that the cause is that one value was replaced by (NO SUMMARY) if white space and the other was not. Adding the line marked # FIX around line 600 seems to correct the problem. This is already fixed in mon-1.1.0pre1, available from http://sourceforge.net/projects/mon/ or ftp://ftp.kernel.org/pub/software/admin/mon/devel/ -David David Nolan*[EMAIL PROTECTED] curses: May you be forced to grep the termcap of an unclean yacc while a herd of rogue emacs fsck your troff and vgrind your pathalias! ___ mon mailing list mon@linux.kernel.org http://linux.kernel.org/mailman/listinfo/mon
Re: Bug: mon.cf keyword error in period section not detected
--On Thursday, March 24, 2005 10:30 PM -0800 Michael Vogt [EMAIL PROTECTED] wrote: Not sure if this has been reported. It is not fixed in mon-1.0.0pre5. Also already fixed in mon-1.1.0pre1. -David David Nolan*[EMAIL PROTECTED] curses: May you be forced to grep the termcap of an unclean yacc while a herd of rogue emacs fsck your troff and vgrind your pathalias! ___ mon mailing list mon@linux.kernel.org http://linux.kernel.org/mailman/listinfo/mon
Re: Mon and non-failure logging
--On Wednesday, March 23, 2005 10:41 AM +0100 Greg [EMAIL PROTECTED] wrote: Hi list, I'm a new user of Mon and thanks to the doc and the easy configuration file syntax I now have a working monitoring system for failure detection. But now I want more :) Are there some couples of monitors/alerts for usual monitoring, i.e. when detected values are in the range of everything-is-ok but we want Mon to report the activity to a log file (maybe rrd database for conveniency). I know this is not the primary usage of Mon, or what I've understood of it, but it would be useful and seems pretty simple to develop. So before re-inventing the wheel for the 42th time I prefer asking (yes, I'm lazy). There are a couple ways you could approach this problem. You could have your monitor script exit with different error codes, and do different alerting based on the error code. You can accomplish this with either 'alert exit=10-20 foo.alert', available in mon 0.99.2 and newer, which aplies to a single alert within a period, or via 'alertexitrange 10-20', available in mon-1.1.0pre1 which applies to all alerts within a period. Or you could use 'redistribute foo.alert', available in mon-1.1.0pre1 which causes the configured alert script to get called for every status update. It was designed to allow you to redistribute all status updates to remote systems (other mon hosts, or other monitoring systems). But you could instead make the alert script log values into an rrd or some other operation. -David David Nolan*[EMAIL PROTECTED] curses: May you be forced to grep the termcap of an unclean yacc while a herd of rogue emacs fsck your troff and vgrind your pathalias! ___ mon mailing list mon@linux.kernel.org http://linux.kernel.org/mailman/listinfo/mon
Re: Bug: mon.cf keyword error in period section not detected
--On Friday, March 25, 2005 9:39 AM -0800 Michael Vogt [EMAIL PROTECTED] wrote: I'll have to take a look at that latest version. How should I have known about these problems prior to noticing them myself? I did not see them in the sourcefurge bug tracking. Where else should I look before reporting problems? Thanks for all your contributions to mon, Sorry, there wasn't really any way for you to know about this. These were probably bugs which I discovered much like you did, and fixed in my local Mon copy. I don't know offhand whether they were in the patches that I sent to Jim back before we started working together directly, but those patches never got integrated anyway. When Jim agreed to collaborate on further Mon development we re-activated what had been essentially a dead sourceforge project and I basically integrated all of my outstanding changes at once, carefully reading through all the changes at the time to make sure I wasn't breaking anything. I also went through and essentially cleared out the pending sourceforge bug queue at the time, because it hadn't been monitored or updated in a long time. -David David Nolan*[EMAIL PROTECTED] curses: May you be forced to grep the termcap of an unclean yacc while a herd of rogue emacs fsck your troff and vgrind your pathalias! ___ mon mailing list mon@linux.kernel.org http://linux.kernel.org/mailman/listinfo/mon
Re: monitoring email capability for monitor alerts
--On Tuesday, March 22, 2005 10:01 AM -0500 Andrew Siegel [EMAIL PROTECTED] wrote: There are many things that can go wrong in the email delivery chain, making it undependable for alerts of an urgent nature. Better to use qpage.alert to send TAP/IXO text messages to a pager. Use a modem directly connected to your mon host, and connected to a direct copper phone line to further minimize things that can go wrong. We use this technique as well as email-based paging, and most of the time the modem-transmitted messages get to the pagers faster. We do something very similar. We have a custom alert script which first attempts to contact SkyTel's SNPP server over the internet, and if unable to contact it falls back to the SNPP server (qpage) on the local machine. It turns out that you can do more interesting things via SkyTel's SNPP server directly then you can via TAP/IXO. In particular I can enable two-way messaging, to allow our coverage people to reply to an alert from the pager. -David David Nolan*[EMAIL PROTECTED] curses: May you be forced to grep the termcap of an unclean yacc while a herd of rogue emacs fsck your troff and vgrind your pathalias! ___ mon mailing list mon@linux.kernel.org http://linux.kernel.org/mailman/listinfo/mon
Re: monitoring email capability for monitor alerts
--On Tuesday, March 22, 2005 8:27 AM -0800 Michael Vogt [EMAIL PROTECTED] wrote: What holes are there in the setup where I just use mail.alert and smsmtponitor from one monitor to the other? The problem with monitoring email submission and reception is that you have no way to know if the mail got all the way to the final hop. You can spend as much time as you want setting up a spiffy environment to verify that mail to address A gets delivered, but that doesn't tell you anything about address B. You might find that your cellular providers provide a way to verify text message delivery, if you're using their web message submission forms. But thats problematic because they're likely to redesign their web pages on a whim, so scripting the web interaction will be problematic. SkyTel's SNPP server provides delivery confirmation information if you use two-way pagers. Have you considered a fallback approach? In our environment we always have two people on duty, and page the primary before the secondary. Maybe a similar approach with different alert mechanisms would make sense. One think you could try is to alert via email first, but then via dialing the user's cell phone directly with a modem. Even if you don't put in a fancy text to speech system, if callerID works your admins can know Hey, Mon is calling, that must mean I missed an alert... If you've got a better secondary alert mechanism, use that instead. -David David Nolan*[EMAIL PROTECTED] curses: May you be forced to grep the termcap of an unclean yacc while a herd of rogue emacs fsck your troff and vgrind your pathalias! ___ mon mailing list mon@linux.kernel.org http://linux.kernel.org/mailman/listinfo/mon
Re: monitoring email capability for monitor alerts
--On Tuesday, March 22, 2005 11:37 AM -0500 Ed Ravin [EMAIL PROTECTED] wrote: I've got a newer version with the fallback paging that I'll release One of These Days Real Soon Now. Maybe around the same time David releases his two-way Skytel code :-) :-). While the whole alert script isn't really designed for use outside of CMU, the relevant portion of the code is: eval { local $SIG{ALRM} = sub {die Timeout during connection }; alarm $timeout; my $snpp = Net::SNPP-new ($server, #Debug = 1, ) or die Unable to connect; local $SIG{ALRM} = sub {die Timeout during communication }; alarm $timeout*2; $snpp-_CALL('mon') || die Failed in _CALL; $snpp-_HELP() || die Failed in _HELP; my $help = $snpp-message; if (grep /RPLY/, $help) { if ($message =~ /ALERT/) { $snpp-_2WAY(); $snpp-_RPLY('[EMAIL PROTECTED]'); $snpp-_MCRE(ack Working on it); $snpp-_MCRE(ack On my way); $snpp-_MCRE(ack Will fix later); $snpp-_MCRE(ack Ignoring); $snpp-_MCRE(disable failing); $snpp-_MCRE(disable-service); $snpp-_MCRE(disable-group); $snpp-_MCRE(enable failing); $snpp-_MCRE(enable-service); $snpp-_MCRE(enable-group); } } $snpp-send ( Pager = $pager, Message = $message); my $status = $snpp-status; if ($status != CMD_OK $status != CMD_2WAYOK $status != CMD_2WAYQUEUED) { die Failed to send to $pager: .$snpp-message;; } $snpp-quit || die Failed to quit; $success = $server; #print STDERR $server: success!\n; }; And you need to add one line to Net::SNPP to add the non-standard RPLY command: sub _RPLY { shift-command(RPLY, @_)-response() == CMD_OK } -David David Nolan*[EMAIL PROTECTED] curses: May you be forced to grep the termcap of an unclean yacc while a herd of rogue emacs fsck your troff and vgrind your pathalias! ___ mon mailing list mon@linux.kernel.org http://linux.kernel.org/mailman/listinfo/mon
Re: Upgrading to 1.1.0.
--On Monday, March 07, 2005 5:56 PM +0100 Marko Riedel [EMAIL PROTECTED] wrote: Hello there, we upgraded to 1.1.0. So far everything seems to be okay, but traps no longer work. We did not change the code at the machines that send traps, except to install the latest version (1) of Mon::Client. Now traps that used to work cause the following output: trap trap 1 from grp=somegroup svc=DYNDNS, sta=255 failure for somegroup DYNDNS 1110213302 somehost DYNDNS OKA As you can see the trap includes the output from the remote host, which says that everything is okay. We did not chage the return codes at all. How can a trap that used to work suddenly turn into a failure? Thank you for your help. Marko, I'm trying to track this down to see if there is a bug. The output you included is the syslog message thats sent on a trap being received. The only problem I see in that message is that the source IP address of the trap isn't being filled in. Are there any other log messages? And can you provide a bit more detail on the exact failure behavior you see? I assume that mon is just ignoring the trap completely. Does it just ignore certain traps, or all of them? -David David Nolan*[EMAIL PROTECTED] curses: May you be forced to grep the termcap of an unclean yacc while a herd of rogue emacs fsck your troff and vgrind your pathalias! ___ mon mailing list mon@linux.kernel.org http://linux.kernel.org/mailman/listinfo/mon
Re: How to succesfully control a virtual service if no real servers are defined?
--On Thursday, March 10, 2005 1:27 AM +0100 Sebastiaan Veldhuisen [EMAIL PROTECTED] wrote: I already got the scripts from Christopher de Marco (ipvs.alert and ipvs.monitor) whih allows you to monitor if a virtual service has real servers defined an take action, but i don't understand how to incorprate them into mon.cf. BIG CAVEAT: I've never used LVS myself, so I'm taking some guesses here... Test this in a lab environment before deploying to a real world environment... In your current model I think you want to add a third watch, which might look something like this: watch webmail-lvs service http description virtual server for umail unsecure interval 30s monitor ipvs.monitor -P tcp -V x.x.x.18:80 ;; period wd {Sun-Sat} alert ipvs.alert -D -P tcp -V x.x.x.18:80 alertafter 2 Thats close to right... Note that there is no upalert defined here, because it would be nonsensical. i.e. trying to bring up the virtual server when you just tested and determined that its up and running would be silly. The upalerts on the per-host tests will take care of creating the virtual server, if i'm reading ipvs.alert correctly... The problem is that i can't put bot real http servers in the same host group, because i have to do different alert actions (read: delete the specific server from the lvs). I saw that Christopher had this same problem in an older thread (http://www.mail-archive.com/mon@linux.kernel.org/msg01427.html). I've contacted him, but he doens't respond to my mails. Anybody has a clue on how to implement this? Lets be precise here... you can't put both real servers in the same group without rewriting the alert script. I think doing so would probably make some sense. In particular you would want a modified version of ipvs.alert which took a port number as one option, and a read the list of real servers to enable/disable from the summary line. Then you could group the hosts together in the way that makes the monitoring solution much more elegant, especially when you start moving beyond two servers in the pool. -David Nolan Network Software Designer Computing Services Carnegie Mellon University ___ mon mailing list mon@linux.kernel.org http://linux.kernel.org/mailman/listinfo/mon
Re: Trouble with -f mon option (deamon mode)
--On Wednesday, February 02, 2005 10:29 AM -0800 Michael Vogt [EMAIL PROTECTED] wrote: This changes the currend directory to /. I want to have files used by monitors be referenced relative to the base directory. It worked fine without using -f. The hostgroup members I use for some custom monitors are actually filenames. I don't want to have to prepend the base directory. What is the reason for the cd /? What do I loose by not using -f when running from inittab? Thanks for any help, Michael Vogt I suggest storing those files in the MON state directory, and using the MON_STATEDIR environment variable that is passed to monitor scripts to find the files. The primary reason that daemons change their working directory is avoid having a running daemon have its working directory in a network mounted filesystem, or in a filesystem you might want to unmount. -David David Nolan*[EMAIL PROTECTED] curses: May you be forced to grep the termcap of an unclean yacc while a herd of rogue emacs fsck your troff and vgrind your pathalias! ___ mon mailing list mon@linux.kernel.org http://linux.kernel.org/mailman/listinfo/mon
Re: Mon: Action upon success?
--On Friday, January 14, 2005 11:46 AM -0600 [EMAIL PROTECTED] wrote: Does anyone here know if or how I can cause an action upon success in mon.cf? I'm working to have mon communicate with an in-house monitoring system. The in-house monitoring system has it's own protocol and tools. We have everything working except that I need to send a heartbeat message when a test succeeds. For those who are curious, this heartbeat message tells the in-house monitor that this host and service is OK and when to expect the next heartbeat. If the next heartbeat does not come within 150% of the heartbeat interval, an alarm goes off. Thanks! Try the current version of Mon from the sourceforge CVS repository. There is a new config option, 'redistribute', which is configured on a service (not inside a period) and runs an alert script on every status update. We use this for sending mon traps between two mon servers, but it could be used for your function just as easily. (Though I just realized I need to add documentation for this option. I thought I'd done that already...) -David Nolan Network Software Designer Computing Services Carnegie Mellon University ___ mon mailing list mon@linux.kernel.org http://linux.kernel.org/mailman/listinfo/mon
Re: Help with mon and process.monitor
--On Tuesday, January 11, 2005 2:07 PM +1100 Craig Reeson [EMAIL PROTECTED] wrote: Now I am getting a SNMP timeout issue (using monshow.cgi). I have tried increasing the timeout in process.monitor but it has made no difference. However, if I just run 'process.monitor -c mycom 172.28.47.60' then it works! I assume your test runs of process.monitor are on the same machine as your mon server. Are you logged in as the user that your mon server runs as? i.e. could it be something about your login environment thats allowing the script to work. Can you post a snippet of your mon.cfg, showing the group definition and the service definition? Also, you might want to try running this monitor script to verify that SNMP transactions with the target host are working. https://bugzilla.andrew.cmu.edu/cgi-bin/cvsweb.cgi/~checkout~/src/netsage/mon/mon.d/host.monitor?rev=1.9 (Thats the script we use to verify that the host is responding to snmp, and test the load average.) -David Nolan Network Software Designer Computing Services Carnegie Mellon University ___ mon mailing list mon@linux.kernel.org http://linux.kernel.org/mailman/listinfo/mon
Re: alerts functionality
--On Monday, November 22, 2004 1:45 PM -0500 Jim Trocki [EMAIL PROTECTED] wrote: so total alerts sent is 1+2+3...+10? is the latter correct? I've only tested it up to two hosts going down consecutively :) it's correct depending on how you configure mon. this is the default behavior, but you can change it. Also, it should be pointed out that this is entirely dependent on the behavior of the monitor script. If the script outputs a different summary, then Mon will alert again (unless configured not to). Most scripts output the list of failing hosts as the summary, but not all. -David David Nolan*[EMAIL PROTECTED] curses: May you be forced to grep the termcap of an unclean yacc while a herd of rogue emacs fsck your troff and vgrind your pathalias! ___ mon mailing list [EMAIL PROTECTED] http://linux.kernel.org/mailman/listinfo/mon
Re: mon logging setup problem
--On Monday, November 15, 2004 10:54 AM -0700 Shea Frederick [EMAIL PROTECTED] wrote: Fixed that, but still not creating a log file. logdir = /var/log/mon dtlogfile = dtlog Ah, I think you also need: dtlogging = 1 (Forgot about that setting...) -David David Nolan*[EMAIL PROTECTED] curses: May you be forced to grep the termcap of an unclean yacc while a herd of rogue emacs fsck your troff and vgrind your pathalias! ___ mon mailing list [EMAIL PROTECTED] http://linux.kernel.org/mailman/listinfo/mon
Anyone going to LISA?
I was wondering if anyone else from the list is attending LISA '04 in Atlanta this week? I'll be down there Tuesday night through Friday. At last year's conference we had a very well attended Mon BOF. If there's enough interest I could arrange to have a BOF session again... -David David Nolan*[EMAIL PROTECTED] curses: May you be forced to grep the termcap of an unclean yacc while a herd of rogue emacs fsck your troff and vgrind your pathalias! ___ mon mailing list [EMAIL PROTECTED] http://linux.kernel.org/mailman/listinfo/mon
Re: Example config for snmp traps?
--On Thursday, November 11, 2004 6:52 AM + Aled Treharne [EMAIL PROTECTED] wrote: I've been using mon for some time now, but I've recently found a need to have mon handle snmp traps generated by a new system. It's the end of a nightshift, so I may be missing something stupidly and lgaringly obvious, but I can't see any information in the docs or example files as to how to set up a service to monitor snmp traps. Should I just do the same as for mon traps? Any help is most gracefully accepted. Despite some misleading documentation, Mon currently has no native support for snmp traps. In order to integrate snmp traps into Mon, you need software which can receive the traps and generate Mon events. I'm attaching a message from this list from a while back where someone reports that they've been able to get the snmptrapd from the Net-SNMP package to integrate with Mon successfully. LooperNG looks like it might be useful for translating SNMP traps into mon events, but doesn't currently have native support for sending stuff to Mon. (They already provide a mon alert script to generate send events to LooperNG, but no vice-versa.) -David David Nolan*[EMAIL PROTECTED] curses: May you be forced to grep the termcap of an unclean yacc while a herd of rogue emacs fsck your troff and vgrind your pathalias!---BeginMessage--- If you are using the net-snmp package, it's fairly straightforward to forward traps to mon with snmptrapd. Here is a script I use as the default traphandle for snmptrapd; it forwards selected traps to a hostgroup/service in mon. (See attached file: snmp2montrap) TORRESANI, Roberto To: [EMAIL PROTECTED] [EMAIL PROTECTED] cc: unitn.it Subject: SNMP traps Sent by: [EMAIL PROTECTED] kernel.org 01/08/02 04:32 AM Hi all, can mon receive snmp traps? As stated in the man page it seems that snmp support isn't implemented. Is that right? Is there a plan on when that will be available? Anyone of you has in the meantime created some patch for mon to enable snmp? Roberto Torresani snmp2montrap Description: Binary data ---End Message--- ___ mon mailing list [EMAIL PROTECTED] http://linux.kernel.org/mailman/listinfo/mon
Re: Generating mon trap for use as heartbeat
--On Tuesday, November 09, 2004 5:32 PM -0800 Konstantin 'Kastus' Shchuka [EMAIL PROTECTED] wrote: On Tue, Nov 09, 2004 at 01:52:32PM -0800, Michael Vogt wrote: I am planning to monitor some application servers on a datacenter with a custom monitor plugin. I want to have another monitor running at a remote location to monitor the main monitor at the datacenter (and vice-versa). It looks like I should use mon traps in heartbeat mode. How do I create the heartbeats. Why can't you use mon.monitor? It does not require any heartbeat, it just does what you are asking for, monitor mon at the other location. Both approaches are valid, and test different things. I currently use mon.monitor to test my multiple mon servers from each other, but mon.monitor only verifies that the remote mon processes is processing client requests. Adding a heartbeat service where the monitor script sends a trap is actually something I hadn't thought of before. I like the idea. It would verify that your mon server processes are successfully queing monitor processes. I've actually had a failure mode in my system at one point where everything looked fine except some percentage of my mon scripts hadn't been run in days. It turned out that my mon server was constantly throttling the number of running processes, due to its configuration, and was running *way* behind. This approach would probably have detected that problem. Michael Vogt wrote: OK. I found remote.alert which sends a trap. So I could modify this, or maybe use it, as is, associated with a failalways.monitor to trigger it. Still not sure if I'd be badly reinventing a wheel. Is there a clean proven way allready implemented? Same thing for the configuration stab. Is there a working example? Using remote.alert as a base would work reasonably. Or if you're willing to wait a day or so, I think I'm going to try to implement this for my system, and I'd be happy to post the script I use and the resulting config blocks as well. -David David Nolan*[EMAIL PROTECTED] curses: May you be forced to grep the termcap of an unclean yacc while a herd of rogue emacs fsck your troff and vgrind your pathalias! ___ mon mailing list [EMAIL PROTECTED] http://linux.kernel.org/mailman/listinfo/mon
Re: Configuration file check
--On Monday, November 08, 2004 11:26 AM +0100 Andrea Carpani [EMAIL PROTECTED] wrote: Is there a way to check the syntax of a mon.cf configuration file before starting mon? Something like perl -c file No, but you can ask a running mon to parse a new config file for errors. If you're using mon.cgi, it provides a 'Test Mon Config File' option, or you can just run moncmd, with the arguments 'test config'. This has always been good enough for my needs, since I always have mon running. But if you require the feature you describe, I could add it fairly easily. -David Nolan Network Software Designer Computing Services Carnegie Mellon University ___ mon mailing list [EMAIL PROTECTED] http://linux.kernel.org/mailman/listinfo/mon
Re: mon.cf in a sql database ?
--On Thursday, September 30, 2004 3:36 PM +0200 Brice Beauvillain [EMAIL PROTECTED] wrote: Hello all, Is it possible for mon to have the mon.cf file in a database ? Thanks in advance, There's no way to do that directly, but at CMU we wrote a system called NetSage which allows us to maintain the data in a database, and generate mon.cf files from that database. And one of these days I swear I'm going to have the time to write some documentation so I can get it released... Unfortunately I've been saying that for over a year. -David Nolan Network Software Designer Computing Services Carnegie Mellon University ___ mon mailing list [EMAIL PROTECTED] http://linux.kernel.org/mailman/listinfo/mon
Re: mon.cf in a sql database ?
--On Thursday, September 30, 2004 5:58 PM -0400 Ed Ravin [EMAIL PROTECTED] wrote: But you can do it indirectly. Use the esyscmd macro in m4: Ewww.. m4. Uh, I mean, ooh, thats kinda neat. :) I wonder how well m4/mon handles it when the esyscmd program takes a long time to return, or just fails. I suspect not well, at least during a config reload. NetSage generates the config files on its own, and then runs a script that copies it into place, asks Mon to test the file, and assuming it passes the test tells mon to reload. An m4 macro that dynamically generates the contents sounds like it would make the 'test and reload' operation both expensive, and unpredictable, since the file would be generated twice, and not guaranteed to be the same. -David Nolan Network Software Designer Computing Services Carnegie Mellon University ___ mon mailing list [EMAIL PROTECTED] http://linux.kernel.org/mailman/listinfo/mon
Re: Using Mon to modify DSN records
--On Monday, September 20, 2004 8:12 AM -0700 Nate Campi [EMAIL PROTECTED] wrote: If you're using BIND it's generally best to use nsupdate since you're not likely to introduce errors into the zone file this way. There also is a Perl module that can handle this, Net::DNS::Update, which is part of the Net::DNS package. We use it extensively to do thousands of DDNS updates a day to our zones. -David David Nolan*[EMAIL PROTECTED] curses: May you be forced to grep the termcap of an unclean yacc while a herd of rogue emacs fsck your troff and vgrind your pathalias! ___ mon mailing list [EMAIL PROTECTED] http://linux.kernel.org/mailman/listinfo/mon
Re: Why no saved state for acks?
--On Thursday, September 16, 2004 11:02 AM -0700 Augie [EMAIL PROTECTED] wrote: -BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Back in 2001 Ed Ravin said the following: http://www.mail-archive.com/[EMAIL PROTECTED]/msg00014.html I'm thinking of coding a patch to mon to include the state of ACK'd services in the saved state. My problem is that if I ACK a service and my mon server gets rebooted for some reason, it will start paging people on alerts that were already acknowledged. ACK state still does not seem to be kept in mon-1-0-0pre4, so my question is; is anyone working on this already? Try using the development version from the sourceforge CVS respository. It has full scheduler state saving capability. I think there is still a small bug with saving ACKs, where sometimes it looses the ACK during a reload, but I'd say it currently works 99% of the time... ___ mon mailing list [EMAIL PROTECTED] http://linux.kernel.org/mailman/listinfo/mon
Re: DNS Monitor
--On Monday, September 13, 2004 10:26 AM -0300 Dalpi [EMAIL PROTECTED] wrote: Hello all, Has anyone faced the following error when using DNS monitor? Zone 'nova.net': failed servers: x.x.x.x Diagnostics: SOA query for nova.net from x.x.x.x failed question section incomplete I'm not being able to discover what is causing this failure. I've captured the packets, but it seems that there is no error in the query/answer. I don't recall seeing that before offhand, but I'm not sure. Run 'dig soa nova.net @x.x.x.x' and see if the result code/content makes sense. If you're not sure what it should look like, please either post the output or send it to me personally if you're concerned about posting the data. -David Nolan Network Software Designer Computing Services Carnegie Mellon University ___ mon mailing list [EMAIL PROTECTED] http://linux.kernel.org/mailman/listinfo/mon
Re: watch http on one server and several ports
--On Thursday, August 12, 2004 5:50 PM +0200 Antoine Reboul [EMAIL PROTECTED] wrote: Hi, /sorry for my pooor english i'm french .../ I have a high disponibility solution (lvs / mon / heartbeat) My webservers host 2 web sites. WebsiteA : adressIP:80 WebsiteB : adressIP:8099 I want that Mon watch each Ports so i wrote this : --- part of mon.cf -- watch RealServer service http interval 30s monitor http.monitor -p 80 period wd {Sun-Sat} numalerts 1 alert lvs.alert upalert lvs.alert service http interval 30s monitor http.monitor -p 8099 period wd {Sun-Sat} numalerts 1 alert lvs.alert upalert lvs.alert --- Your services need to have unique names, so 'service http' and 'service http-8099' or similar. (And the fact that Mon doesn't notice this and complain is a bug.) -David David Nolan*[EMAIL PROTECTED] curses: May you be forced to grep the termcap of an unclean yacc while a herd of rogue emacs fsck your troff and vgrind your pathalias! ___ mon mailing list [EMAIL PROTECTED] http://linux.kernel.org/mailman/listinfo/mon
Re: shutdown heartbeat
--On Thursday, July 08, 2004 8:30 AM +0200 mixo [EMAIL PROTECTED] wrote: How can I shut down hearbeat from an alert script? This does not seem to work: # + # !/bin/sh /usr/lib/mon/alert.d/mail.alert $* /etc/init.d/heartbeat stop # + The email sent out, but heartbeat is not stopped. Are you running Mon as a user that can shutdown heartbeat? If you run the script by hand, does it shut down heartbeat? If so, figure out what is different between your environment and the one that Mon provides. Most likely PATH is set differently, and the heartbeat script isn't setting PATH or fully specifying the path to some program. -David David Nolan*[EMAIL PROTECTED] curses: May you be forced to grep the termcap of an unclean yacc while a herd of rogue emacs fsck your troff and vgrind your pathalias! ___ mon mailing list [EMAIL PROTECTED] http://linux.kernel.org/mailman/listinfo/mon
Re: INSTALL updates
--On Wednesday, July 07, 2004 11:26 AM -0700 Eric Sorenson [EMAIL PROTECTED] wrote: I got frustrated trying to show someone how to install mon, so I rewrote chunks of the INSTALL doc to match reality. Apply or ignore as you see fit. Excellent, thanks. Jim, I'll apply these. -David David Nolan*[EMAIL PROTECTED] curses: May you be forced to grep the termcap of an unclean yacc while a herd of rogue emacs fsck your troff and vgrind your pathalias! ___ mon mailing list [EMAIL PROTECTED] http://linux.kernel.org/mailman/listinfo/mon
Re: fping and root permissions
--On Monday, July 05, 2004 4:50 PM -0700 Joubin Moshrefzadeh [EMAIL PROTECTED] wrote: If I run mon as a regular user and am using fping.monitor, I get the error about needing root permissions or running fping with setuid root. how do you do the setuid thing? As root: chmod +s /path/to/fping i had the same problem before using perl's ping module and trying to do an icmp ping... This wouldn't be solvable without using suidperl, which introduces a whole slew of other issues. -David David Nolan*[EMAIL PROTECTED] curses: May you be forced to grep the termcap of an unclean yacc while a herd of rogue emacs fsck your troff and vgrind your pathalias! ___ mon mailing list [EMAIL PROTECTED] http://linux.kernel.org/mailman/listinfo/mon
Re: Why doesn't _trap_timer get reset?
--On Monday, June 28, 2004 3:36 PM -0400 Jim Trocki [EMAIL PROTECTED] wrote: On Mon, 28 Jun 2004, David Nolan wrote: While it doesn't add any bugs, I don't believe it fixes any either. it does indeed fix the bug where a received would not reset the _trap_timer, preventing traptimeout from working at all. i've tested it and it works properly now. trap timeouts work already. I get them on occasion. Careful reading of the code makes it clear that _trap_timer is only ever relevant after a timeout has already occurred. It prevents a timeout alert from happening on every pass through the code. that is not true. _trap_timer is what counts down timeout counter in the first place. it is what gauges whether or not a timeout has occurred. once a timeout happens, as indicated when _trap_timer drops to zero or below, is that do_alert is called and _trap_timer is then reset to the value of traptimeout, and it starts counting down again. what's supposed to prevent _trap_timer from hitting 0 in the first place is the reception of a trap, and that is what was broken, and the patch i posted fixes that. Here's the code that actually decides whether or not to call handle_trap_timeout: if ($sref-{_trap_timer} = 0 $tm - $sref-{_last_trap} $sref-{traptimeout}) { $sref-{_trap_timer} = $sref-{traptimeout}; handle_trap_timeout ($group, $service); } (This is from the CVS head version, the mon-1-0-pre* version uses _last_uptrap, not _last_trap. IIRC I decided that was a logic bug and fixed it in my code. But thats a different issue.) Note the second half of the if clause. The if clause is confusing, so I'll re-order it and put some parens in: if (($tm - $sref-{_last_trap}) $sref-{traptimeout}) ($sref-{_trap_timer} = 0)) { $sref-{_trap_timer} = $sref-{traptimeout}; handle_trap_timeout ($group, $service); } So there are two clauses. One is testing whether we've recieved a trap within the traptimeout window. The other test is checking whether _trap_timer is set. And since the only code that ever resets _trap_timer is inside the if statement, the only reason it wouldn't be less than zero is if a trap timeout has fired recently, or we're in the Mon just started recently state. Again, I believe the only bug here is that the code is confusing. (Either its confusing both you and Tim, or its confusing me. I believe its you. :) Either _trap_timer should be made the only thing that controls timeouts (apply the patch to reset on each trap, and remove the second clause of the existing if statement) or it should be removed and replaced with the _last_traptimeout style code as I suggested earlier. Either way is acceptable to me. -David David Nolan*[EMAIL PROTECTED] curses: May you be forced to grep the termcap of an unclean yacc while a herd of rogue emacs fsck your troff and vgrind your pathalias! ___ mon mailing list [EMAIL PROTECTED] http://linux.kernel.org/mailman/listinfo/mon
Re: Why doesn't _trap_timer get reset?
--On Saturday, June 26, 2004 6:57 PM -0500 Tim Klein [EMAIL PROTECTED] wrote: But what if the trap never times out? It appears that the value of _trap_timer just keeps getting decremented forever! (There's a different conditional that keeps alerts from being sent after it gets below zero.) I can't find anything in the code that could ever reset it. Am I misunderstanding the intended purpose of _trap_timer? Tim, Having just read this code, I'll agree that its a bit confusing. But I don't believe this is a bug. Essentially _trap_timer is used entirely as a way to prevent trap timeout alarms from happening on every pass through the code after the timeout is reached. I.e. the actual check for the timeout is where it compares ($tm - $sref-{_last_trap}) to $sref-{traptimeout}. And then when a trap timeout actually occurs, _trap_timer is reset so that no more timeout alerts will be sent until that much time has passed again. Does that help? -David Nolan Network Software Designer Computing Services Carnegie Mellon University ___ mon mailing list [EMAIL PROTECTED] http://linux.kernel.org/mailman/listinfo/mon
Re: Problem with _trap_timer and long trap timeouts?
--On Sunday, June 27, 2004 1:45 PM -0500 Tim Klein [EMAIL PROTECTED] wrote: Since we're on the topic of that _trap_timer thingie... Upon launch or reset of mon, each trap's _trap_timer is set to the value of its traptimeout. After that, _trap_timer keeps getting decremented as time progresses. This seems to make sense. But, as I read the code, it's impossible for an alert to be sent about a trap timeout unless _trap_timer has reached 0. So let's say I have a trap whose timeout is 1 month. I can't get alerted about this trap until at least 1 month has passed since the most recent launch or reset of 'mon', right? So does that mean the only way I'll ever know about a timeout of that trap is if I manage to go a month without relaunching or resetting mon? Yes, that is true. But think about the reverse case, where you have received a trap within the last month, but Mon has restarted since then. Basically, unless Mon is remembering full opstatus information between restarts, the timer must be initialized to the full timeout value at startup. And the current Mon (both 0.99.2 and the 1.0-pre* versions) Note that the code in the sourceforge CVS head contains support for saving and restoring the full opstatus. However your question lead me to look at that code and notice that trap_timer isn't saved in the current code. I'll fix that and commit it shortly. (We don't use trap timeouts very much, so I'd never noticed before.) -David David Nolan*[EMAIL PROTECTED] curses: May you be forced to grep the termcap of an unclean yacc while a herd of rogue emacs fsck your troff and vgrind your pathalias! ___ mon mailing list [EMAIL PROTECTED] http://linux.kernel.org/mailman/listinfo/mon
Re: Are traps processed while scheduler is stopped?
--On Wednesday, June 23, 2004 10:44 AM -0500 Tim Klein [EMAIL PROTECTED] wrote: If I pause the scheduler by doing moncmd stop, presumably I won't get alerted about traps or trap timeouts. But will incoming traps still get noticed? That is, will last_trap still get updated as traps arrive, even though the scheduler is stopped? Actually with Mon 0.99.* and the 1.0.pre* code you *WILL* get alerts if you get traps while the scheduler is stopped. This is one of the bugs that are fixed in my code, i.e. in the CVS HEAD on sourceforge, soon to be released as Mon 1.1.*. In that code alerts will not happen when traps are recieved while the scheduler is stopped, but the traps be processed, and their information will show in the user interface. (But any non-trap services will obviously not be running, and stale data will sit around.) -David David Nolan*[EMAIL PROTECTED] curses: May you be forced to grep the termcap of an unclean yacc while a herd of rogue emacs fsck your troff and vgrind your pathalias! ___ mon mailing list [EMAIL PROTECTED] http://linux.kernel.org/mailman/listinfo/mon
Re: Are traps processed while scheduler is stopped?
--On Wednesday, June 23, 2004 9:48 AM -0700 Jim Trocki [EMAIL PROTECTED] wrote: ok well whether or not it's a bug may be debatable (traps aren't scheduled so stopping the scheduler shouldn't affect them), but it sounds like the behavior of your version may be more intuitive. maybe not. i don't know. there seem to be a number of nuances here, and i originally intended moncmd stop to control these first two: related to scheduling loop: -stopping monitors from being scheduled thus stopping the possibility of alerts from them -stopping trap timeout alerts not related to the scheduling loop: -stopping alerts from traps -delaying/stopping the processing of inbound traps thoughts about the other two, or maybe more? My thought here is that 'moncmd stop' is the emergency stop button. i.e. Something is horribly broken, mon is paging everyone about everything. STOP!. Thus all things directly monitored should stop being monitored, and abolutely no alerts should be generated for any reason. Thats basically what I made it do. -David David Nolan*[EMAIL PROTECTED] curses: May you be forced to grep the termcap of an unclean yacc while a herd of rogue emacs fsck your troff and vgrind your pathalias! ___ mon mailing list [EMAIL PROTECTED] http://linux.kernel.org/mailman/listinfo/mon
RE: mon 1.0.0pre2 and mon-client 1.0.0pre2 are in cvs on sourcefo rge
--On Monday, June 21, 2004 7:20 AM -0700 Jim Trocki [EMAIL PROTECTED] wrote: On Mon, 21 Jun 2004, Peter Wirdemo (MO/EMW) wrote: It must be a little odd, releasing a 1-0-0 version, for a software nearly 10 years old... not really odd at all--it's just the next release version. it could be named anything at all. would 7 be a better version number? How about Pi? On a more serious note, Jim and I are working pretty closely now on the new Mon code via sourceforge. All of my changes have been applied to the CVS repository, and once we've tested them a fair bit you can expect to see a mon-devel-1.1.X branch available. (Of course, I'm about to head to conferences for two weeks. Anyone on the list going to be at Usenix?) -David David Nolan*[EMAIL PROTECTED] curses: May you be forced to grep the termcap of an unclean yacc while a herd of rogue emacs fsck your troff and vgrind your pathalias! ___ mon mailing list [EMAIL PROTECTED] http://linux.kernel.org/mailman/listinfo/mon
Re: david nolan's patches
--On Monday, June 07, 2004 1:32 PM -0700 Jim Trocki [EMAIL PROTECTED] wrote: On Thu, 3 Jun 2004, David Nolan wrote: (In fact, I may have posted it to the list, but I can't recall right now. Time for some email archeology.) ahh, i apologize for my confusion. clearly my recollection was faulty, and you now corrected it. thanks. as far as maintaining the code in cvs with the intention of allowing better cooperation amongst ourselves, i think it's a good idea. i don't know if the sourceforge thing is what would be best. it does have some advantages, such as the bug tracking functionality, mon is already a registered project there and all (i haven't looked at that thing in forever), but cvs tends to aggravate me. i guess i've been living with it long enough to just accept it if that's all that sourceforge offers. i'd prefer giving subversion a try. Since I use CVS everyday for all the project I work on, CVS would be fine with me. If you prefer another option, I'm sure we can work it out. Ultimately, I'll be continuing to maintain the CMU custom version in our CVS tree, and importing changes from your version. So I'll have to deal with two different repositories anyway. Sourceforge CVS would seem to be the easiest path, and as Scott points out it gives us some other features as well. If you don't have any strenuous objections, why don't we go ahead and start using sourceforge? Upload the current stable version and the devel version, give me access, and I'll work on integrating my changes. (My sourceforge userid is vitroth) -David David Nolan*[EMAIL PROTECTED] curses: May you be forced to grep the termcap of an unclean yacc while a herd of rogue emacs fsck your troff and vgrind your pathalias! ___ mon mailing list [EMAIL PROTECTED] http://linux.kernel.org/mailman/listinfo/mon
Re: status of Mon development failures?
--On Thursday, June 03, 2004 10:30 AM -0700 Jim Trocki [EMAIL PROTECTED] wrote: -rw-r--r--1 536536 179923 Apr 23 13:37 mon-0.99.3-41.tar.gz It might help if you made announcements about new dev versions being available. Have you started integrating any of the patches I've sent you yet? If not, are you going to do so anytime soon, or should I just give up? Multiple subscribers to the mon mailing list have asked me for a copy of my patched version of Mon. I've given it to several and received nothing but positive feedback. I've been resisting the urge to package up and release CMU-Mon as a fork, but maybe I should. Part of the reason the NetSage monconfiguration system hasn't been released yet is that its really designed to take advantage of all the features I added. -David David Nolan*[EMAIL PROTECTED] curses: May you be forced to grep the termcap of an unclean yacc while a herd of rogue emacs fsck your troff and vgrind your pathalias! ___ mon mailing list [EMAIL PROTECTED] http://linux.kernel.org/mailman/listinfo/mon
Re: david nolan's patches
--On Thursday, June 03, 2004 10:52 AM -0700 Jim Trocki [EMAIL PROTECTED] wrote: this is a matter of historical record which should be public. rather than post his patched version to the mailing list for everyone to have a gander at and do something with if they chose, he sent them only to me (afaik), and since then i've been implicated as the reason why those patches haven't been distributed to anyone else. i don't think that's the right way to make progress, so i'm posting the diff between what he sent to me and the closest release to it at the time, which is 0.99.2. If you're looking to have an accurate historical record, you should at least post the long description I sent you of the patch. As I recall, I itemized the entire patch, breaking it down into about 20 different changes, and for EVERY LINE in the patch I documented which changes it was a part of. I spent a couple of hours doing that, so that you could pick and choose which portions of the patch you wanted to apply. If you no longer have that information, I can dig it up. (In fact, I may have posted it to the list, but I can't recall right now. Time for some email archeology.) By the way, Jim, I don't want you to feel like we're upset with you personally. But the problem is that last spring the issue of new mon releases came up, and we had several people interested in doing joint development of the system. But you spoke up and said you had some new versions for us to test, and you still wanted to be the primary maintainer. We all accepted that and trusted you to move the project forward. But it has become increasingly clear to most of us that you just don't have the time to do more then maintenance releases to Mon, and maybe not even that. Many of us are willing to volunteer our time to help Mon continue to evolve and become a better system. Please let us help you! ___ mon mailing list [EMAIL PROTECTED] http://linux.kernel.org/mailman/listinfo/mon
Re: status of Mon development failures?
--On Thursday, June 03, 2004 11:02 AM -0700 Jim Trocki [EMAIL PROTECTED] wrote: On Thu, 3 Jun 2004, Jim Trocki wrote: yeah, something's funny there. i saved the message by using the pipe raw text command in pine then ran uudeview on it, and that's what i got. in the raw message it has no --ikeVEW9yuYc//A+q to terminate the mime attachment. i'll have a look at this one you just sent and stick it into the latest. thanks. wtf, the one you just sent has the same problem. maybe it's an mua problem on your end? i had a look at the mail as delivered by the mta on my end and it doesn't have an ending --BXVAT5kNtrzKuDFl. maybe that line after the format STDOUT thing which begins with --- is messing things up somehow, since there's a blank line after that and nothing else. try gzipping the thing first then sending that as an attachment. It looks like the Mon mailing list is playing games with removing lines starting with dashes, and following lines. My last message had a signature that looked like the following, with a dash before my name on the first line of the sig, but the copy I got back from the list didn't have the dash. I bet the signature stripping is being overzealous and hitting attachments as well. David Nolan Network Software Developer Computing Services Carnegie Mellon University ___ mon mailing list [EMAIL PROTECTED] http://linux.kernel.org/mailman/listinfo/mon
Re: david nolan's patches
--On Thursday, June 03, 2004 2:25 PM -0400 Ed Ravin [EMAIL PROTECTED] wrote: On Thu, Jun 03, 2004 at 10:52:03AM -0700, Jim Trocki wrote: this is a matter of historical record which should be public. rather than post his patched version to the mailing list for everyone to have a gander at and do something with if they chose, he sent them only to me Sounds like he wanted to respect your role as maintainer of Mon, and run major changes by you before releasing them to anyone else. The patches probably arrived at a moment when you didn't have time to look at them, allowing the misunderstanding and subsequent miscommunication to fester. Bingo. In fact, here's a quote from a message I sent to mon-l last June: If anyone is interested in using my code, contact me and I'll point you to our CVS repository. (Note: I'm *not* interested in forking mon, but if more people are testing my code, maybe Jim will be willing to integrate it into the mainline more quickly.) I even got a request for access from Jim, and in the message I sent him I gave the URL for our CVS repository and said (among other things): I'm intending to fix these issues before sending you a patch. But, as I said, I'm waiting till you release something resembling my CVS version 2.0 (which is the version I assigned to the last patch I sent to you), and then I'll send you another patch, or patches. I'm not going to send this URL to the mon list. I don't want tons of people using this code, because I'm trying to discourage a mon fork. These kinds of problems would be less likely to happen if we were using Sourceforge or the like, since both the latest development version and submitted patches would be publicly visible to all. Any publicly available CVS repository would be great. I'm not sure whether sourceforge is the best option, but ultimately I don't care as long as it works. David Nolan Network Software Developer Computing Services Carnegie Mellon University ___ mon mailing list [EMAIL PROTECTED] http://linux.kernel.org/mailman/listinfo/mon
Re: Frustrating hangs...
--On Tuesday, June 01, 2004 6:24 PM -0700 Ray Van Dolson [EMAIL PROTECTED] wrote: We're using WebMonkey as a front-end to mon (latest development version) and we're getting extremely slow performance as it queries the mon server. Are any of your service tests outputting large amounts of text? i.e. a hundred lines or so? One of the modules that Mon uses (Text::Parsewords) is horribly inefficent, and basically becomes unusable with that much output. (Large numbers of hosts in one hostgroup might exhibit the same behavior, if I remember right.) Unfortunately, the only solution is to eliminate the usage of that perl module. I've done that in my local version of Mon and the performance improvement was incredible. There are other reasons to eliminate that module anyway (it can cause perl to segfault). If you're interested in those patches, let me know. -David David Nolan*[EMAIL PROTECTED] curses: May you be forced to grep the termcap of an unclean yacc while a herd of rogue emacs fsck your troff and vgrind your pathalias! ___ mon mailing list [EMAIL PROTECTED] http://linux.kernel.org/mailman/listinfo/mon
Re: Downtime Log Bug?
--On Friday, April 23, 2004 10:49 AM +0200 Christian Hertel [EMAIL PROTECTED] wrote: Our mon always tolds us that his Downtime Logfile starts at 1.1.1970. Even if I erase the dt_logfile, the same error occurs after a few minutes. There is a bug, where blank lines from the dtlog are being output to the client, and the client is interpreting the timestamp as zero. The fix is a single line change. Search for this line in mon sock_write ($fh, $_ ) if (!/^#/); and replace it with: sock_write ($fh, $_ ) if (!/^#/ !/^\s*$/); (Yet another bug that I've had fixed in our copy of Mon for 1.5 years, but that I haven't submitted to Jim because he hasn't released the last set of patches I sent him.) -David David Nolan*[EMAIL PROTECTED] curses: May you be forced to grep the termcap of an unclean yacc while a herd of rogue emacs fsck your troff and vgrind your pathalias! ___ mon mailing list [EMAIL PROTECTED] http://linux.kernel.org/mailman/listinfo/mon
Re: Segmentation fault when running under mod_perl
--On Friday, March 05, 2004 11:34 AM +0100 Stephane Bortzmeyer [EMAIL PROTECTED] wrote: I use Mon::Client and mod_perl to serve information from mon on the Web. At the command line, everything is fine, but when running under mod_perl (either Mason or Apache::Registry), I experience the infamous Segmentation fault when calling things like list_opstatus. Other mon commands work. Are any of your monitor scripts returning particularly large summary/detail messages? Or are you running a large number of tests? There are some known bugs with Perl regexp parser that Mon occasionally runs into. In particular, Mon 0.99.2 uses Text::ParseWords which is both horribly slow and has ridiculously complex regexps that sometimes cause Perl to segfault, especially if the input data is large. I've patched my copy of Mon and Mon::Client to use split in the cases that are most likely to cause a problem. If you're interested I can send you the patch. It unfortunately requires a small change in the Mon client protocol, but any program that uses Mon::Client should work fine. And the change for non Mon::Client programs is probably 3 lines. IMO, this is one of the big reasons why we *really* need a new stable release of Mon. I hope this is fixed in the development version, but I haven't personally tested it, as my Mon infrastructure is heavily dependent on the changes I've made to Mon, and Jim hasn't yet applied any of the patches I've sent him, as far as I know. -David Nolan Network Software Developer Computing Services Carnegie Mellon University ___ mon mailing list [EMAIL PROTECTED] http://linux.kernel.org/mailman/listinfo/mon