[ClusterLabs] nfsserver_monitor() doesn't detect nfsd process is lost.
Hello. I have been a problem with nfsserver RA on RHEL 7.1 and systemd. When the nfsd process is lost with unexpectly failure, nfsserver_monitor() doesn't detect it and doesn't execute failover. I use the below RA.(but this problem may be caused with latest nfsserver RA as well) https://github.com/ClusterLabs/resource-agents/blob/v3.9.6/heartbeat/nfsserver The cause is following. 1. After execute "pkill -9 nfsd", "systemctl status nfs-server.service" returns 0. 2. nfsserver_monitor() judge with the return value of "systemctl status nfs-server.service". -- # ps ax | grep nfsd 25193 ?S< 0:00 [nfsd4] 25194 ?S< 0:00 [nfsd4_callbacks] 25197 ?S 0:00 [nfsd] 25198 ?S 0:00 [nfsd] 25199 ?S 0:00 [nfsd] 25200 ?S 0:00 [nfsd] 25201 ?S 0:00 [nfsd] 25202 ?S 0:00 [nfsd] 25203 ?S 0:00 [nfsd] 25204 ?S 0:00 [nfsd] 25238 pts/0S+ 0:00 grep --color=auto nfsd # # pkill -9 nfsd # # systemctl status nfs-server.service ● nfs-server.service - NFS server and services Loaded: loaded (/etc/systemd/system/nfs-server.service; disabled; vendor preset: disabled) Active: active (exited) since 木 2016-01-14 11:35:39 JST; 1min 3s ago Process: 25184 ExecStart=/usr/sbin/rpc.nfsd $RPCNFSDARGS (code=exited, status=0/SUCCESS) Process: 25182 ExecStartPre=/usr/sbin/exportfs -r (code=exited, status=0/SUCCESS) Main PID: 25184 (code=exited, status=0/SUCCESS) CGroup: /system.slice/nfs-server.service (snip) # # echo $? 0 # # ps ax | grep nfsd 25256 pts/0S+ 0:00 grep --color=auto nfsd -- It is because the nfsd process is kernel process, and systemd does not monitor the state of the kernel process of running. Is there something good way? (When I use "pidof" instead of "systemctl status", the faileover is successful.) Regards, Yuta Takeshita ___ Users mailing list: Users@clusterlabs.org http://clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
[ClusterLabs] Fwd: Parallel adding of resources
Hi I am running a 2 node cluster with this config on centos 6.6 Master/Slave Set: foo-master [foo] Masters: [ kamet ] Slaves: [ orana ] fence-uc-orana (stonith:fence_ilo4): Started kamet fence-uc-kamet (stonith:fence_ilo4): Started orana C-3 (ocf::pw:IPaddr): Started kamet C-FLT (ocf::pw:IPaddr): Started kamet C-FLT2 (ocf::pw:IPaddr): Started kamet E-3 (ocf::pw:IPaddr): Started kamet MGMT-FLT (ocf::pw:IPaddr): Started kamet M-FLT (ocf::pw:IPaddr): Started kamet M-FLT2 (ocf::pw:IPaddr): Started kamet S-FLT (ocf::pw:IPaddr): Started kamet S-FLT2 (ocf::pw:IPaddr): Started kamet where i have a multi-state resource foo being run in master/slave mode and IPaddr RA is just modified IPAddr2 RA. Additionally i have a collocation constraint for the IP addr to be collocated with the master. I have additionally configured fencing and when i plug out the redundancy interface fencing gets triggered correctly. However once the fenced node(kamet) is rejoining i see all my floating IP resources are deleted and system looks to be in this state. Also if i log into kamet i see that the floating ip addresses are actually available. Master/Slave Set: foo-master [foo] Masters: [ orana ] Slaves: [ kamet ] fence-uc-orana (stonith:fence_ilo4): Started orana fence-uc-kamet (stonith:fence_ilo4): Started orana CIB state post fencing of kamet. Attaching full corosync.log from orana. Mentioning the interesting parts in the log here. Jan 13 19:32:44 corosync [TOTEM ] A processor joined or left the membership and a new membership was formed. Jan 13 19:32:44 corosync [QUORUM] Members[2]: 1 2 Jan 13 19:32:44 corosync [QUORUM] Members[2]: 1 2 Jan 13 19:32:44 [4296] orana crmd: info: cman_event_callback: Membership 7044: quorum retained Jan 13 19:32:44 [4296] orana crmd: notice: crm_update_peer_state: cman_event_callback: Node kamet[2] - state is now member (was lost) Jan 13 19:32:44 [4296] ora
[ClusterLabs] fence-agents 4.0.22 release
Welcome to the fence-agents 4.0.22 release This release includes several bugfixes and features: * New fence agents for VirtualBox and SBD * A lot of changes in fence_compute (OpenStack) * Re-enable fence_zvm * Support for APC firmware v6.x * Add hard-reboot option for fence_scsi_check script * Add option for setting Docker Remote API version * Fix HP Brocade fence agent where getting status was broken * Fix regression in IPMI fence agent (timeout settings, deprecated options) * New action ‘diag’ for fence_ipmi Git repository can be found at https://github.com/ClusterLabs/fence-agents/ The new source tarball can be downloaded here: https://github.com/ClusterLabs/fence-agents/archive/v4.0.22.tar.gz To report bugs or issues: https://bugzilla.redhat.com/ Would you like to meet the cluster team or members of its community? Join us on IRC (irc.freenode.net #linux-cluster) and share your experience with other sysadministrators or power users. Thanks/congratulations to all people that contributed to achieve this great milestone. m, ___ Users mailing list: Users@clusterlabs.org http://clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] Q: What is the meaning of "sbd: [19541]: info: Watchdog enabled."
On 01/13/2016 04:34 AM, Ulrich Windl wrote: > Since an update of sbd in SLES11 SP4 (sbd-1.2.1-0.12.1), I see > frequent syslog messages like these (grep "Watchdog enabled." > /var/log/messages): Hi, This happened to me as well. It turned out I was running a monitor on my SBD resource. I removed it and everything went back to normal. I started working with Linux HA with SLES 11 SP4 (so I don't know how was the behavior before that) but, since you mentioned it started happening with SP4, then it must be a bug then. I'm one of those that don't want to see ANYTHING on the logs unless is completely necessary (no news is good news). Going back to the original issue, I guess the proper question is: Should we run a monitor operation with the SBD resource? I just did a test on one of my test VMs: I killed the parent SBD process (kill -9) and the VM restarted (hard-reset) right away. I'm not sure who initiated it (stonithd or pacemaker); it wasn't the watchdog because it was too fast (I have it set to 15 seconds). There you go... Regards, Jorge ___ Users mailing list: Users@clusterlabs.org http://clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
[ClusterLabs] Q: What is the meaning of "sbd: [19541]: info: Watchdog enabled."
Hi! Since an update of sbd in SLES11 SP4 (sbd-1.2.1-0.12.1), I see frequent syslog messages like these (grep "Watchdog enabled." /var/log/messages): Jan 13 00:01:01 h02 sbd: [19373]: info: Watchdog enabled. Jan 13 00:01:01 h02 sbd: [19380]: info: Watchdog enabled. Jan 13 00:04:02 h02 sbd: [21740]: info: Watchdog enabled. Jan 13 00:04:02 h02 sbd: [21747]: info: Watchdog enabled. Jan 13 00:07:04 h02 sbd: [24073]: info: Watchdog enabled. Jan 13 00:07:04 h02 sbd: [24080]: info: Watchdog enabled. Jan 13 00:10:05 h02 sbd: [26381]: info: Watchdog enabled. Jan 13 00:10:05 h02 sbd: [26388]: info: Watchdog enabled. Jan 13 00:13:06 h02 sbd: [28748]: info: Watchdog enabled. Jan 13 00:13:06 h02 sbd: [28755]: info: Watchdog enabled. Jan 13 00:16:07 h02 sbd: [31066]: info: Watchdog enabled. Jan 13 00:16:07 h02 sbd: [31073]: info: Watchdog enabled. Jan 13 00:19:08 h02 sbd: [1000]: info: Watchdog enabled. Jan 13 00:19:08 h02 sbd: [1008]: info: Watchdog enabled. Jan 13 00:22:09 h02 sbd: [3377]: info: Watchdog enabled. Jan 13 00:22:09 h02 sbd: [3388]: info: Watchdog enabled. Jan 13 00:25:10 h02 sbd: [5777]: info: Watchdog enabled. Jan 13 00:25:10 h02 sbd: [5784]: info: Watchdog enabled. Jan 13 00:28:11 h02 sbd: [8157]: info: Watchdog enabled. Jan 13 00:28:11 h02 sbd: [8166]: info: Watchdog enabled. Jan 13 00:31:13 h02 sbd: [10453]: info: Watchdog enabled. Jan 13 00:31:13 h02 sbd: [10460]: info: Watchdog enabled. Jan 13 00:34:14 h02 sbd: [12909]: info: Watchdog enabled. Jan 13 00:34:14 h02 sbd: [12916]: info: Watchdog enabled. Jan 13 00:37:15 h02 sbd: [15244]: info: Watchdog enabled. Jan 13 00:37:15 h02 sbd: [15251]: info: Watchdog enabled. Jan 13 00:40:16 h02 sbd: [17661]: info: Watchdog enabled. Jan 13 00:40:16 h02 sbd: [17669]: info: Watchdog enabled. Jan 13 00:43:17 h02 sbd: [20020]: info: Watchdog enabled. Jan 13 00:43:17 h02 sbd: [20027]: info: Watchdog enabled. Jan 13 00:46:19 h02 sbd: [22332]: info: Watchdog enabled. Jan 13 00:46:19 h02 sbd: [22338]: info: Watchdog enabled. Jan 13 00:49:20 h02 sbd: [24679]: info: Watchdog enabled. Jan 13 00:49:20 h02 sbd: [24686]: info: Watchdog enabled. [...] It seems some process is started (new PID), but why? Is it a bug, maybe? Regards, Ulrich Regards, Ulrich ___ Users mailing list: Users@clusterlabs.org http://clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org