Re: [Linux-HA] Strange HB Status displayed for root vs. unprivilegedusers; bug or feature?

2008-07-02 Thread Michael Schwartzkopff
Am Mittwoch, 2. Juli 2008 07:51 schrieb [EMAIL PROTECTED]:
(...) [Long discussion, shortened to save bandwidth]

Why do you folks do not use plain SNMP? heartbeat has a wonderful subagent! 
SNMP is internet standard (RFC), everywhere implemented and platform 
independend! Contrary to your own nagios installation.

Greetings,

-- 
Dr. Michael Schwartzkopff
MultiNET Services GmbH
Addresse: Bretonischer Ring 7; 85630 Grasbrunn; Germany
Tel: +49 - 89 - 45 69 11 0
Fax: +49 - 89 - 45 69 11 21
mob: +49 - 174 - 343 28 75

mail: [EMAIL PROTECTED]
web: www.multinet.de

Sitz der Gesellschaft: 85630 Grasbrunn
Registergericht: Amtsgericht München HRB 114375
Geschäftsführer: Günter Jurgeneit, Hubert Martens

---

PGP Fingerprint: F919 3919 FF12 ED5A 2801 DEA6 AA77 57A4 EDD8 979B
Skype: misch42
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


RE: [Linux-HA] Strange HB Status displayed for root vs. unprivilegedusers; bug or feature?

2008-07-01 Thread Ralph.Grothe
Hi Michael,

many thanks for the quick reply.

 
 I've just had a quick look through the source to see what the -s
 flag actually does (I'll need to set up monitoring of heartbeat in
 Nagios shortly, as it happens). It reads the PID file and then
 checks if the process is running, and that the process with the PID
 it's checking is actually heartbeat (by checking that its
 /proc/.../exe is a link to the heartbeat binary).
 

Yes, of course that's one of the advantages of open source
that you always can look at them, which I have forgotten,
but a strace for the syscalls open (and kill, notice the SIG_0 which is the 
check if the proc is still alive as it looks)
reveals what files need to be readable.

# strace -e trace=open,kill /usr/lib64/heartbeat/heartbeat -s 21|grep -A3 
\.pid
open(/var/run/heartbeat.pid, O_RDONLY) = 3
kill(31017, SIG_0)  = 0
open(/usr/lib64/pils/plugins/InterfaceMgr/generic.so, O_RDONLY) = 3
open(/etc/ha.d/nodeinfo, O_RDONLY)= -1 ENOENT (No such file or directory)
heartbeat OK [pid 31017 et al] is running on nodeA [nodeA]...


 
 On my system, even though the process directory and the symlinks
 therein appear to be world-readable, they're not:
 

This seems to be similar on my system.

While the PID file is world-readable

# ls -l /var/run/heartbeat.pid 
-rw-r--r-- 1 root root 11 Jun 27 10:36 /var/run/heartbeat.pid

some of the symlinks and other files in the proc's procfs subdir (here 
restricted to 1st subdir level for brevity) aren't


# tr -d \\040  /var/run/heartbeat.pid|xargs -iPID find /proc/PID -maxdepth 1 
-follow ! -perm -004 -ls
20327301210 dr-x--   2 root root0 Jul  1 08:13 
/proc/31017/fd
20327301220 -r   1 root root0 Jul  2 07:09 
/proc/31017/environ
20327301230 -r   1 root root0 Jul  2 07:09 
/proc/31017/auxv
20327301170 -rw---   1 root root0 Jul  2 07:09 
/proc/31017/mem
 102471 drwx--   2 root root 1024 Feb  6 13:18 
/proc/31017/cwd
20327301300 -r   1 root root0 Jul  2 07:09 
/proc/31017/mountstats
20327301320 -r   1 root root0 Jul  2 07:09 
/proc/31017/smaps



But I don't beleive that any missing rights on /proc/31017 are causing the 
problem here
but the kill() syscall, as seen in strace's dump.
Afaik, only root or the process owner may signal the proc
even though there is only the harmless SIG_0 involved.
I think the developers deemed this way of checking the validity of a possibly 
stale PID as read from the pid file
much terser than fumbling with pstat() structures, or whatever the Linux 
syscall equivalent may be. 

 2. Set up sudo or similar so Nagios can do the check

This is what I did, which was the most straight forward method,
especially since I already applied a sudo ruleset for that user munin
to be able to run a few Munin plugins which require elevated privileges as well.


What only puzzles me is that my check_heartbeat.sh plugin worked together 
with the former Heartbeat installation
without requiring any quirks.

Regards
Ralph

 -Original Message-
 From: [EMAIL PROTECTED]
 [mailto:[EMAIL PROTECTED] Behalf Of Michael Alger
 Sent: Tuesday, July 01, 2008 5:56 PM
 To: General Linux-HA mailing list
 Subject: Re: [Linux-HA] Strange HB Status displayed for root vs.
 unprivilegedusers; bug or feature?
 
 
 On Tue, Jul 01, 2008 at 04:04:54PM +0200, 
 [EMAIL PROTECTED] wrote:
  After I had successfully upgraded this cluster to the new OS I was
  wondering, why my Nagios plugin always returned CRITICAL states
  though heartbeat was running on the node at the time.
  Then I discovered that the output of my check command differed
  decisively depending on who executed the check.
  
  e.g. as root I get
  
  # /usr/lib64/nagios/plugins/custom/check_heartbeat.sh
  OK - heartbeat is running on nodeA
  
  or rather what really gets executed in that plugin and whose
  output merely gets parsed is
  
  # /usr/lib64/heartbeat/heartbeat -s
  heartbeat OK [pid 31017 et al] is running on nodeA [nodeA]...
  
  # pgrep -P1 -fl heartbeat
  31017 heartbeat: master control process
  
  But when run as an unprivileged user, as is the case when the nrpe
  daemon is executing the check, oops, I get this strange result
  
  # /usr/lib64/nagios/plugins/check_nrpe -n -H localhost -c 
 check_heartbeat
  CRITICAL - heartbeat is stopped on nodeA
  
  How come, is this a bug or intended behavior?
 
 I've just had a quick look through the source to see what the -s
 flag actually does (I'll need to set up monitoring of heartbeat in
 Nagios shortly, as it happens). It reads the PID file and then
 checks if the process is running, and that the process with the PID
 it's checking is actually heartbeat (by checking that its
 /proc/.../exe is a link to the heartbeat binary).
 
 On my system, even though the process directory and the symlinks
 therein appear to be world-readable