[ClusterLabs] nfsserver_monitor() doesn't detect nfsd process is lost.

2016-01-13 Thread yuta takeshita
Hello.

I have been a problem with nfsserver RA on RHEL 7.1 and systemd.
When the nfsd process is lost with unexpectly failure, nfsserver_monitor()
doesn't detect it and doesn't execute failover.

I use the below RA.(but this problem may be caused with latest nfsserver RA
as well)
https://github.com/ClusterLabs/resource-agents/blob/v3.9.6/heartbeat/nfsserver

The cause is following.

1. After execute "pkill -9 nfsd", "systemctl status nfs-server.service"
returns 0.
2. nfsserver_monitor() judge with the return value of "systemctl status
nfs-server.service".

--
# ps ax | grep nfsd
25193 ?S< 0:00 [nfsd4]
25194 ?S< 0:00 [nfsd4_callbacks]
25197 ?S  0:00 [nfsd]
25198 ?S  0:00 [nfsd]
25199 ?S  0:00 [nfsd]
25200 ?S  0:00 [nfsd]
25201 ?S  0:00 [nfsd]
25202 ?S  0:00 [nfsd]
25203 ?S  0:00 [nfsd]
25204 ?S  0:00 [nfsd]
25238 pts/0S+ 0:00 grep --color=auto nfsd
#
# pkill -9 nfsd
#
# systemctl status nfs-server.service
● nfs-server.service - NFS server and services
   Loaded: loaded (/etc/systemd/system/nfs-server.service; disabled; vendor
preset: disabled)
   Active: active (exited) since 木 2016-01-14 11:35:39 JST; 1min 3s ago
  Process: 25184 ExecStart=/usr/sbin/rpc.nfsd $RPCNFSDARGS (code=exited,
status=0/SUCCESS)
  Process: 25182 ExecStartPre=/usr/sbin/exportfs -r (code=exited,
status=0/SUCCESS)
 Main PID: 25184 (code=exited, status=0/SUCCESS)
   CGroup: /system.slice/nfs-server.service
(snip)
#
# echo $?
0
#
# ps ax | grep nfsd
25256 pts/0S+ 0:00 grep --color=auto nfsd
--

It is because the nfsd process is kernel process, and systemd does not
monitor the state of the kernel process of running.

Is there something good way?
(When I use "pidof" instead of "systemctl status", the faileover is
successful.)

Regards,
Yuta Takeshita
___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


[ClusterLabs] Fwd: Parallel adding of resources

2016-01-13 Thread Arjun Pandey
Hi

I am running a 2 node cluster with this config on centos 6.6

Master/Slave Set: foo-master [foo]
   Masters: [ kamet ]
   Slaves: [ orana ]
fence-uc-orana (stonith:fence_ilo4): Started kamet
fence-uc-kamet (stonith:fence_ilo4): Started orana
C-3 (ocf::pw:IPaddr): Started kamet
C-FLT (ocf::pw:IPaddr): Started kamet
C-FLT2 (ocf::pw:IPaddr): Started kamet
E-3 (ocf::pw:IPaddr): Started kamet
MGMT-FLT (ocf::pw:IPaddr): Started kamet
M-FLT (ocf::pw:IPaddr): Started kamet
M-FLT2 (ocf::pw:IPaddr): Started kamet
S-FLT (ocf::pw:IPaddr): Started kamet
S-FLT2 (ocf::pw:IPaddr): Started kamet


where i have a multi-state resource foo being run in master/slave mode
and  IPaddr RA is just modified IPAddr2 RA. Additionally i have a
collocation constraint for the IP addr to be collocated with the master.
I have additionally configured fencing and when i plug out the
redundancy interface fencing gets triggered correctly. However once
the fenced node(kamet) is rejoining i see all my floating IP resources
are deleted
and system looks to be in this state. Also if i log into kamet i see
that the floating ip addresses are actually available.

Master/Slave Set: foo-master [foo]
   Masters: [ orana ]
   Slaves: [ kamet ]
fence-uc-orana (stonith:fence_ilo4): Started orana
fence-uc-kamet (stonith:fence_ilo4): Started orana

CIB state post fencing of kamet.

  

  






  


  
  


  

  

  
  






  


  
  
  
  
  
  

  
  

  
  
  
  
  
  
  


  
  


  

  
  

  
  
  
  
  
  
  


  
  


  
  

  
  
  
  


  
  
  


  
  

  
  

  
  
  
  


  
  
  


  
  

  
  

  
  
  
  


  
  
  


  
  

  
  

  
  
  
  


  
  
  


  
  

  
  

  
  
  
  


  
  
  


  
  

  
  

  
  
  
  


  
  
  


  
  

  
  

  
  
  
  


  
  
  


  
  

  
  

  
  
  
  


  
  
  


  
  

  
  

  
  
  
  


  
  
  


  
  

  


  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  

  
  


  

  


  
  


  
  


  
  


  
  


  
  


  
  


  
  


  
  


  
  


  
  


  
  


  

  
  

  
  
  

  

  



Attaching full corosync.log from orana.

Mentioning the interesting parts in the log here.

Jan 13 19:32:44 corosync [TOTEM ] A processor joined or left the
membership and a new membership was formed.
Jan 13 19:32:44 corosync [QUORUM] Members[2]: 1 2
Jan 13 19:32:44 corosync [QUORUM] Members[2]: 1 2
Jan 13 19:32:44 [4296] orana   crmd: info:
cman_event_callback: Membership 7044: quorum retained
Jan 13 19:32:44 [4296] orana   crmd:   notice:
crm_update_peer_state: cman_event_callback: Node kamet[2] - state is
now member (was lost)
Jan 13 19:32:44 [4296] ora

[ClusterLabs] fence-agents 4.0.22 release

2016-01-13 Thread Marek marx Grác
Welcome to the fence-agents 4.0.22 release 

This release includes several bugfixes and features: 

* New fence agents for VirtualBox and SBD
* A lot of changes in fence_compute (OpenStack)
* Re-enable fence_zvm 

* Support for APC firmware v6.x
* Add hard-reboot option for fence_scsi_check script
* Add option for setting Docker Remote API version
* Fix HP Brocade fence agent where getting status was broken
* Fix regression in IPMI fence agent (timeout settings, deprecated options)
* New action ‘diag’ for fence_ipmi


Git repository can be found at https://github.com/ClusterLabs/fence-agents/ 

The new source tarball can be downloaded here: 

https://github.com/ClusterLabs/fence-agents/archive/v4.0.22.tar.gz 

To report bugs or issues: 

https://bugzilla.redhat.com/ 

Would you like to meet the cluster team or members of its community? 

Join us on IRC (irc.freenode.net #linux-cluster) and share your 
experience with other sysadministrators or power users. 

Thanks/congratulations to all people that contributed to achieve this 
great milestone. 

m, 

___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Q: What is the meaning of "sbd: [19541]: info: Watchdog enabled."

2016-01-13 Thread Jorge Fábregas
On 01/13/2016 04:34 AM, Ulrich Windl wrote:
> Since an update of sbd in SLES11 SP4 (sbd-1.2.1-0.12.1), I see
> frequent syslog messages like these (grep "Watchdog enabled."
> /var/log/messages):

Hi,

This happened to me as well.  It turned out I was running a monitor on
my SBD resource.  I removed it and everything went back to normal.

I started working with Linux HA with SLES 11 SP4 (so I don't know how
was the behavior before that) but, since you mentioned it started
happening with SP4, then it must be a bug then.

I'm one of those that don't want to see ANYTHING on the logs unless is
completely necessary (no news is good news).

Going back to the original issue,  I guess the proper question is:
Should we run a monitor operation with the SBD resource? I just did a
test on one of my test VMs:  I killed the parent SBD process (kill -9)
and the VM restarted (hard-reset) right away.  I'm not sure who
initiated it (stonithd or pacemaker); it wasn't the watchdog because it
was too fast (I have it set to 15 seconds).  There you go...



Regards,
Jorge

___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


[ClusterLabs] Q: What is the meaning of "sbd: [19541]: info: Watchdog enabled."

2016-01-13 Thread Ulrich Windl
Hi!

Since an update of sbd in SLES11 SP4 (sbd-1.2.1-0.12.1), I see frequent syslog 
messages like these (grep "Watchdog enabled." /var/log/messages):
Jan 13 00:01:01 h02 sbd: [19373]: info: Watchdog enabled.
Jan 13 00:01:01 h02 sbd: [19380]: info: Watchdog enabled.
Jan 13 00:04:02 h02 sbd: [21740]: info: Watchdog enabled.
Jan 13 00:04:02 h02 sbd: [21747]: info: Watchdog enabled.
Jan 13 00:07:04 h02 sbd: [24073]: info: Watchdog enabled.
Jan 13 00:07:04 h02 sbd: [24080]: info: Watchdog enabled.
Jan 13 00:10:05 h02 sbd: [26381]: info: Watchdog enabled.
Jan 13 00:10:05 h02 sbd: [26388]: info: Watchdog enabled.
Jan 13 00:13:06 h02 sbd: [28748]: info: Watchdog enabled.
Jan 13 00:13:06 h02 sbd: [28755]: info: Watchdog enabled.
Jan 13 00:16:07 h02 sbd: [31066]: info: Watchdog enabled.
Jan 13 00:16:07 h02 sbd: [31073]: info: Watchdog enabled.
Jan 13 00:19:08 h02 sbd: [1000]: info: Watchdog enabled.
Jan 13 00:19:08 h02 sbd: [1008]: info: Watchdog enabled.
Jan 13 00:22:09 h02 sbd: [3377]: info: Watchdog enabled.
Jan 13 00:22:09 h02 sbd: [3388]: info: Watchdog enabled.
Jan 13 00:25:10 h02 sbd: [5777]: info: Watchdog enabled.
Jan 13 00:25:10 h02 sbd: [5784]: info: Watchdog enabled.
Jan 13 00:28:11 h02 sbd: [8157]: info: Watchdog enabled.
Jan 13 00:28:11 h02 sbd: [8166]: info: Watchdog enabled.
Jan 13 00:31:13 h02 sbd: [10453]: info: Watchdog enabled.
Jan 13 00:31:13 h02 sbd: [10460]: info: Watchdog enabled.
Jan 13 00:34:14 h02 sbd: [12909]: info: Watchdog enabled.
Jan 13 00:34:14 h02 sbd: [12916]: info: Watchdog enabled.
Jan 13 00:37:15 h02 sbd: [15244]: info: Watchdog enabled.
Jan 13 00:37:15 h02 sbd: [15251]: info: Watchdog enabled.
Jan 13 00:40:16 h02 sbd: [17661]: info: Watchdog enabled.
Jan 13 00:40:16 h02 sbd: [17669]: info: Watchdog enabled.
Jan 13 00:43:17 h02 sbd: [20020]: info: Watchdog enabled.
Jan 13 00:43:17 h02 sbd: [20027]: info: Watchdog enabled.
Jan 13 00:46:19 h02 sbd: [22332]: info: Watchdog enabled.
Jan 13 00:46:19 h02 sbd: [22338]: info: Watchdog enabled.
Jan 13 00:49:20 h02 sbd: [24679]: info: Watchdog enabled.
Jan 13 00:49:20 h02 sbd: [24686]: info: Watchdog enabled.
[...]

It seems some process is started (new PID), but why? Is it a bug, maybe?

Regards,
Ulrich

Regards,
Ulrich



___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org