Hi Michael,
many thanks for the quick reply.
I've just had a quick look through the source to see what the -s
flag actually does (I'll need to set up monitoring of heartbeat in
Nagios shortly, as it happens). It reads the PID file and then
checks if the process is running, and that the process with the PID
it's checking is actually heartbeat (by checking that its
/proc/.../exe is a link to the heartbeat binary).
Yes, of course that's one of the advantages of open source
that you always can look at them, which I have forgotten,
but a strace for the syscalls open (and kill, notice the SIG_0 which is the
check if the proc is still alive as it looks)
reveals what files need to be readable.
# strace -e trace=open,kill /usr/lib64/heartbeat/heartbeat -s 21|grep -A3
\.pid
open(/var/run/heartbeat.pid, O_RDONLY) = 3
kill(31017, SIG_0) = 0
open(/usr/lib64/pils/plugins/InterfaceMgr/generic.so, O_RDONLY) = 3
open(/etc/ha.d/nodeinfo, O_RDONLY)= -1 ENOENT (No such file or directory)
heartbeat OK [pid 31017 et al] is running on nodeA [nodeA]...
On my system, even though the process directory and the symlinks
therein appear to be world-readable, they're not:
This seems to be similar on my system.
While the PID file is world-readable
# ls -l /var/run/heartbeat.pid
-rw-r--r-- 1 root root 11 Jun 27 10:36 /var/run/heartbeat.pid
some of the symlinks and other files in the proc's procfs subdir (here
restricted to 1st subdir level for brevity) aren't
# tr -d \\040 /var/run/heartbeat.pid|xargs -iPID find /proc/PID -maxdepth 1
-follow ! -perm -004 -ls
20327301210 dr-x-- 2 root root0 Jul 1 08:13
/proc/31017/fd
20327301220 -r 1 root root0 Jul 2 07:09
/proc/31017/environ
20327301230 -r 1 root root0 Jul 2 07:09
/proc/31017/auxv
20327301170 -rw--- 1 root root0 Jul 2 07:09
/proc/31017/mem
102471 drwx-- 2 root root 1024 Feb 6 13:18
/proc/31017/cwd
20327301300 -r 1 root root0 Jul 2 07:09
/proc/31017/mountstats
20327301320 -r 1 root root0 Jul 2 07:09
/proc/31017/smaps
But I don't beleive that any missing rights on /proc/31017 are causing the
problem here
but the kill() syscall, as seen in strace's dump.
Afaik, only root or the process owner may signal the proc
even though there is only the harmless SIG_0 involved.
I think the developers deemed this way of checking the validity of a possibly
stale PID as read from the pid file
much terser than fumbling with pstat() structures, or whatever the Linux
syscall equivalent may be.
2. Set up sudo or similar so Nagios can do the check
This is what I did, which was the most straight forward method,
especially since I already applied a sudo ruleset for that user munin
to be able to run a few Munin plugins which require elevated privileges as well.
What only puzzles me is that my check_heartbeat.sh plugin worked together
with the former Heartbeat installation
without requiring any quirks.
Regards
Ralph
-Original Message-
From: [EMAIL PROTECTED]
[mailto:[EMAIL PROTECTED] Behalf Of Michael Alger
Sent: Tuesday, July 01, 2008 5:56 PM
To: General Linux-HA mailing list
Subject: Re: [Linux-HA] Strange HB Status displayed for root vs.
unprivilegedusers; bug or feature?
On Tue, Jul 01, 2008 at 04:04:54PM +0200,
[EMAIL PROTECTED] wrote:
After I had successfully upgraded this cluster to the new OS I was
wondering, why my Nagios plugin always returned CRITICAL states
though heartbeat was running on the node at the time.
Then I discovered that the output of my check command differed
decisively depending on who executed the check.
e.g. as root I get
# /usr/lib64/nagios/plugins/custom/check_heartbeat.sh
OK - heartbeat is running on nodeA
or rather what really gets executed in that plugin and whose
output merely gets parsed is
# /usr/lib64/heartbeat/heartbeat -s
heartbeat OK [pid 31017 et al] is running on nodeA [nodeA]...
# pgrep -P1 -fl heartbeat
31017 heartbeat: master control process
But when run as an unprivileged user, as is the case when the nrpe
daemon is executing the check, oops, I get this strange result
# /usr/lib64/nagios/plugins/check_nrpe -n -H localhost -c
check_heartbeat
CRITICAL - heartbeat is stopped on nodeA
How come, is this a bug or intended behavior?
I've just had a quick look through the source to see what the -s
flag actually does (I'll need to set up monitoring of heartbeat in
Nagios shortly, as it happens). It reads the PID file and then
checks if the process is running, and that the process with the PID
it's checking is actually heartbeat (by checking that its
/proc/.../exe is a link to the heartbeat binary).
On my system, even though the process directory and the symlinks
therein appear to be world-readable