Hi Michael,

many thanks for the quick reply.

> 
> I've just had a quick look through the source to see what the -s
> flag actually does (I'll need to set up monitoring of heartbeat in
> Nagios shortly, as it happens). It reads the PID file and then
> checks if the process is running, and that the process with the PID
> it's checking is actually heartbeat (by checking that its
> /proc/.../exe is a link to the heartbeat binary).
> 

Yes, of course that's one of the advantages of open source
that you always can look at them, which I have forgotten,
but a strace for the syscalls open (and kill, notice the SIG_0 which is the 
check if the proc is still alive as it looks)
reveals what files need to be readable.

# strace -e trace=open,kill /usr/lib64/heartbeat/heartbeat -s 2>&1|grep -A3 
\.pid
open("/var/run/heartbeat.pid", O_RDONLY) = 3
kill(31017, SIG_0)                      = 0
open("/usr/lib64/pils/plugins/InterfaceMgr/generic.so", O_RDONLY) = 3
open("/etc/ha.d/nodeinfo", O_RDONLY)    = -1 ENOENT (No such file or directory)
heartbeat OK [pid 31017 et al] is running on nodeA [nodeA]...


> 
> On my system, even though the process directory and the symlinks
> therein appear to be world-readable, they're not:
> 

This seems to be similar on my system.

While the PID file is world-readable

# ls -l /var/run/heartbeat.pid 
-rw-r--r-- 1 root root 11 Jun 27 10:36 /var/run/heartbeat.pid

some of the symlinks and other files in the proc's procfs "subdir" (here 
restricted to 1st subdir level for brevity) aren't


# tr -d \\040 < /var/run/heartbeat.pid|xargs -iPID find /proc/PID -maxdepth 1 
-follow ! -perm -004 -ls
2032730121    0 dr-x------   2 root     root            0 Jul  1 08:13 
/proc/31017/fd
2032730122    0 -r--------   1 root     root            0 Jul  2 07:09 
/proc/31017/environ
2032730123    0 -r--------   1 root     root            0 Jul  2 07:09 
/proc/31017/auxv
2032730117    0 -rw-------   1 root     root            0 Jul  2 07:09 
/proc/31017/mem
 10247    1 drwx------   2 root     root         1024 Feb  6 13:18 
/proc/31017/cwd
2032730130    0 -r--------   1 root     root            0 Jul  2 07:09 
/proc/31017/mountstats
2032730132    0 -r--------   1 root     root            0 Jul  2 07:09 
/proc/31017/smaps



But I don't beleive that any missing rights on /proc/31017 are causing the 
problem here
but the kill() syscall, as seen in strace's dump.
Afaik, only root or the process owner may signal the proc
even though there is only the harmless SIG_0 involved.
I think the developers deemed this way of checking the validity of a possibly 
stale PID as read from the pid file
much terser than fumbling with pstat() structures, or whatever the Linux 
syscall equivalent may be. 

> 2. Set up sudo or similar so Nagios can do the check

This is what I did, which was the most straight forward method,
especially since I already applied a sudo ruleset for that user munin
to be able to run a few Munin plugins which require elevated privileges as well.


What only puzzles me is that my check_heartbeat.sh "plugin" worked together 
with the former Heartbeat installation
without requiring any quirks.

Regards
Ralph

> -----Original Message-----
> From: [EMAIL PROTECTED]
> [mailto:[EMAIL PROTECTED] Behalf Of Michael Alger
> Sent: Tuesday, July 01, 2008 5:56 PM
> To: General Linux-HA mailing list
> Subject: Re: [Linux-HA] Strange HB Status displayed for root vs.
> unprivilegedusers; bug or feature?
> 
> 
> On Tue, Jul 01, 2008 at 04:04:54PM +0200, 
> [EMAIL PROTECTED] wrote:
> > After I had successfully upgraded this cluster to the new OS I was
> > wondering, why my Nagios plugin always returned CRITICAL states
> > though heartbeat was running on the node at the time.
> > Then I discovered that the output of my check command differed
> > decisively depending on who executed the check.
> > 
> > e.g. as root I get
> > 
> > # /usr/lib64/nagios/plugins/custom/check_heartbeat.sh
> > OK - heartbeat is running on nodeA
> > 
> > or rather what really gets executed in that plugin and whose
> > output merely gets parsed is
> > 
> > # /usr/lib64/heartbeat/heartbeat -s
> > heartbeat OK [pid 31017 et al] is running on nodeA [nodeA]...
> > 
> > # pgrep -P1 -fl heartbeat
> > 31017 heartbeat: master control process
> > 
> > But when run as an unprivileged user, as is the case when the nrpe
> > daemon is executing the check, oops, I get this strange result
> > 
> > # /usr/lib64/nagios/plugins/check_nrpe -n -H localhost -c 
> check_heartbeat
> > CRITICAL - heartbeat is stopped on nodeA
> > 
> > How come, is this a bug or intended behavior?
> 
> I've just had a quick look through the source to see what the -s
> flag actually does (I'll need to set up monitoring of heartbeat in
> Nagios shortly, as it happens). It reads the PID file and then
> checks if the process is running, and that the process with the PID
> it's checking is actually heartbeat (by checking that its
> /proc/.../exe is a link to the heartbeat binary).
> 
> On my system, even though the process directory and the symlinks
> therein appear to be world-readable, they're not:
> 
> $ ls -la /proc/`sed 's/ *//' /var/run/heartbeat.pid`
> ls: cannot read symbolic link /proc/18467/cwd: Permission denied
> ls: cannot read symbolic link /proc/18467/root: Permission denied
> ls: cannot read symbolic link /proc/18467/exe: Permission denied
> 
> When heartbeat tries to ascertain that the process running with that
> particularly pid is actually heartbeat, it encounters an error and
> therefore fails.
> 
> I'm not sure if this aspect of the proc filesystem's behaviour can
> be adjusted, or if it's desirable to adjust it. So, I would suggest
> one of:
> 
> 1. Go with your approach of just checking the process listing
> 2. Set up sudo or similar so Nagios can do the check
> 3. Set up a scheduled job to do a check as root, and write the result
>    status code and a line of output to a file somewhere. Then the
>    Nagios check command can check that the status file was
>    updated recently, and if so use that for its own response.
> 
> I'll probably go with option #2 or #3, but I haven't really looked
> into how exactly I'm going to ascertain that heartbeat is up and
> running. Possibly I'll use crm_mon -1 and check that the expected
> nodes are both online, and set a warning status if either is
> offline (and critical if I can't work out their status at all).
> _______________________________________________
> Linux-HA mailing list
> Linux-HA@lists.linux-ha.org
> http://lists.linux-ha.org/mailman/listinfo/linux-ha
> See also: http://linux-ha.org/ReportingProblems
> 
_______________________________________________
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Reply via email to