Hi Michael, many thanks for the quick reply.
> > I've just had a quick look through the source to see what the -s > flag actually does (I'll need to set up monitoring of heartbeat in > Nagios shortly, as it happens). It reads the PID file and then > checks if the process is running, and that the process with the PID > it's checking is actually heartbeat (by checking that its > /proc/.../exe is a link to the heartbeat binary). > Yes, of course that's one of the advantages of open source that you always can look at them, which I have forgotten, but a strace for the syscalls open (and kill, notice the SIG_0 which is the check if the proc is still alive as it looks) reveals what files need to be readable. # strace -e trace=open,kill /usr/lib64/heartbeat/heartbeat -s 2>&1|grep -A3 \.pid open("/var/run/heartbeat.pid", O_RDONLY) = 3 kill(31017, SIG_0) = 0 open("/usr/lib64/pils/plugins/InterfaceMgr/generic.so", O_RDONLY) = 3 open("/etc/ha.d/nodeinfo", O_RDONLY) = -1 ENOENT (No such file or directory) heartbeat OK [pid 31017 et al] is running on nodeA [nodeA]... > > On my system, even though the process directory and the symlinks > therein appear to be world-readable, they're not: > This seems to be similar on my system. While the PID file is world-readable # ls -l /var/run/heartbeat.pid -rw-r--r-- 1 root root 11 Jun 27 10:36 /var/run/heartbeat.pid some of the symlinks and other files in the proc's procfs "subdir" (here restricted to 1st subdir level for brevity) aren't # tr -d \\040 < /var/run/heartbeat.pid|xargs -iPID find /proc/PID -maxdepth 1 -follow ! -perm -004 -ls 2032730121 0 dr-x------ 2 root root 0 Jul 1 08:13 /proc/31017/fd 2032730122 0 -r-------- 1 root root 0 Jul 2 07:09 /proc/31017/environ 2032730123 0 -r-------- 1 root root 0 Jul 2 07:09 /proc/31017/auxv 2032730117 0 -rw------- 1 root root 0 Jul 2 07:09 /proc/31017/mem 10247 1 drwx------ 2 root root 1024 Feb 6 13:18 /proc/31017/cwd 2032730130 0 -r-------- 1 root root 0 Jul 2 07:09 /proc/31017/mountstats 2032730132 0 -r-------- 1 root root 0 Jul 2 07:09 /proc/31017/smaps But I don't beleive that any missing rights on /proc/31017 are causing the problem here but the kill() syscall, as seen in strace's dump. Afaik, only root or the process owner may signal the proc even though there is only the harmless SIG_0 involved. I think the developers deemed this way of checking the validity of a possibly stale PID as read from the pid file much terser than fumbling with pstat() structures, or whatever the Linux syscall equivalent may be. > 2. Set up sudo or similar so Nagios can do the check This is what I did, which was the most straight forward method, especially since I already applied a sudo ruleset for that user munin to be able to run a few Munin plugins which require elevated privileges as well. What only puzzles me is that my check_heartbeat.sh "plugin" worked together with the former Heartbeat installation without requiring any quirks. Regards Ralph > -----Original Message----- > From: [EMAIL PROTECTED] > [mailto:[EMAIL PROTECTED] Behalf Of Michael Alger > Sent: Tuesday, July 01, 2008 5:56 PM > To: General Linux-HA mailing list > Subject: Re: [Linux-HA] Strange HB Status displayed for root vs. > unprivilegedusers; bug or feature? > > > On Tue, Jul 01, 2008 at 04:04:54PM +0200, > [EMAIL PROTECTED] wrote: > > After I had successfully upgraded this cluster to the new OS I was > > wondering, why my Nagios plugin always returned CRITICAL states > > though heartbeat was running on the node at the time. > > Then I discovered that the output of my check command differed > > decisively depending on who executed the check. > > > > e.g. as root I get > > > > # /usr/lib64/nagios/plugins/custom/check_heartbeat.sh > > OK - heartbeat is running on nodeA > > > > or rather what really gets executed in that plugin and whose > > output merely gets parsed is > > > > # /usr/lib64/heartbeat/heartbeat -s > > heartbeat OK [pid 31017 et al] is running on nodeA [nodeA]... > > > > # pgrep -P1 -fl heartbeat > > 31017 heartbeat: master control process > > > > But when run as an unprivileged user, as is the case when the nrpe > > daemon is executing the check, oops, I get this strange result > > > > # /usr/lib64/nagios/plugins/check_nrpe -n -H localhost -c > check_heartbeat > > CRITICAL - heartbeat is stopped on nodeA > > > > How come, is this a bug or intended behavior? > > I've just had a quick look through the source to see what the -s > flag actually does (I'll need to set up monitoring of heartbeat in > Nagios shortly, as it happens). It reads the PID file and then > checks if the process is running, and that the process with the PID > it's checking is actually heartbeat (by checking that its > /proc/.../exe is a link to the heartbeat binary). > > On my system, even though the process directory and the symlinks > therein appear to be world-readable, they're not: > > $ ls -la /proc/`sed 's/ *//' /var/run/heartbeat.pid` > ls: cannot read symbolic link /proc/18467/cwd: Permission denied > ls: cannot read symbolic link /proc/18467/root: Permission denied > ls: cannot read symbolic link /proc/18467/exe: Permission denied > > When heartbeat tries to ascertain that the process running with that > particularly pid is actually heartbeat, it encounters an error and > therefore fails. > > I'm not sure if this aspect of the proc filesystem's behaviour can > be adjusted, or if it's desirable to adjust it. So, I would suggest > one of: > > 1. Go with your approach of just checking the process listing > 2. Set up sudo or similar so Nagios can do the check > 3. Set up a scheduled job to do a check as root, and write the result > status code and a line of output to a file somewhere. Then the > Nagios check command can check that the status file was > updated recently, and if so use that for its own response. > > I'll probably go with option #2 or #3, but I haven't really looked > into how exactly I'm going to ascertain that heartbeat is up and > running. Possibly I'll use crm_mon -1 and check that the expected > nodes are both online, and set a warning status if either is > offline (and critical if I can't work out their status at all). > _______________________________________________ > Linux-HA mailing list > Linux-HA@lists.linux-ha.org > http://lists.linux-ha.org/mailman/listinfo/linux-ha > See also: http://linux-ha.org/ReportingProblems > _______________________________________________ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems