Re: [Pacemaker] Colocation set options (pcs syntax)
pcs constraint colocation set fs_ldap-clone sftp01-vip ldap1 sequential=true Let me know if this does or doesn't work for you. I have been testing this now for a couple days and I think I must be doing something wrong, firstly though, the command itself completes successfully: # pcs constraint show --full Resource Sets: set fs_ldap-clone sftp01-vip ldap1 sequential=true (id:pcs_rsc_set) (id:pcs_rsc_colocation) However, if I try to test it by moving, for example, the "sftp01-vip" resource group to another node, then is does not move the ldap1 service with it, example below: Cluster state before resource move: http://pastebin.com/a13ZhyRq Then I do "pcs resource move sftp01-vip bfievsftp02", which moves resources to the node (except the associated ldap1 service) Cluster state after the move: http://pastebin.com/BSyTBEhX Full constraint list: http://pastebin.com/ng6m4C1Z Here is what I am trying to achieve: [1] The sftp0[1-3]-vip groups each have a prefered node (sftp01-vip=node1, sftp02-vip=node2, sftp03-vip=node3 [2] The sftp0[1-3] lsb resources are colocated with sftp0[1-3]-vip groups [3] The ldap[1-3] lsb resources are colocated with sftp0[1-3]-vip groups I managed to achieve the above using logic contraints as listed in the constraint output, however, the sftp0[1-3] and ldap[1-3] lsb resources also depend on fs_cdr-clone and fs_ldap-clone respectively, being available. I thought I would be able to achive that file system dependancy using the colocation set, but this does not seem to work the way I am expecting it to, or, quite possibly, my logic may be slightly(largely) off :) How would I ensure, that in the case of a node failure, the vip group moves to a node which has the fs_cdr and fs_ldap file system resources available? If I can do that, then, I can keep the colocation rule for the sftp/ldap service with the vip group. Or am I thinking about this the wrong way around? Any tips/suggestions would be appreciated. Thanks ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] Colocation set options (pcs syntax)
> > I think there's an error in the man page (which I'll work on getting fixed). Thanks Chris. > Can you try: (removing 'setoptions' from your command) > > > pcs constraint colocation set fs_ldap-clone sftp01-vip ldap1 sequential=true > > > Let me know if this does or doesn't work for you. > I shall give this a go a little later today and get back to you ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
[Pacemaker] Colocation set options (pcs syntax)
Hi All, I have several resources that depend on a cloned share file system and vip that need to be up and operational before the resource can start, I was reading the pacemaker documentation and it looks like colocation sets is what I am after. I can see in the documentation that you can define a colocation set and set the sequential option to "true" if you need the resources to start sequentially, I guess this then becomes an ordered colocation set which is what I am after, documentation I was reading is here: http://clusterlabs.org/doc/en-US/Pacemaker/1.1/html/Pacemaker_Explained/s-re source-sets-collocation.html According to the pcs man page I can setup a colocation set as follows: colocation set [resourceN]... [setoptions] ... [set ...] [setoptions =...] However when I run the following command to create the set: pcs constraint colocation set fs_ldap-clone sftp01-vip ldap1 setoptions sequential=true I get an error stating: Error: Unable to update cib Call cib_replace failed (-203): Update does not conform to the configured schema And then a dump of the current running info base. Am I reading the man page incorrectly, or is this a bug I need to report? Thanks ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] node1 fencing itself after node2 being fenced
Just an update on this issue which has now been resolved. The issue was with my cluster configuration, dlm + sctp does not play nice with each other, I had to un-configure redundant rings and set rrp_mode to "none" after which clvmd works as expected. Thanks to all for your assistance in this issue. From: Asgaroth [mailto:li...@blueface.com] Sent: 10 February 2014 11:46 To: 'Mailing List: Pacemaker' Subject: RE: node1 fencing itself after node2 being fenced Hi All, OK, here is my testing using cman/clvmd enabled on system startup and clvmd outside of pacemaker control. I still seem to be getting the clvmd hang/fail situation even when running outside of pacemaker control, I cannot see off-hand where the issue is occurring, but maybe it is related to what Vladislav was saying where clvmd hangs if it is not running on a cluster node that has cman running, however, I have both cman/clvmd enable to start at boot. Here is a little synopsis of what appears to be happening here: [1] Everything is fine here, both nodes up and running: # cman_tool nodes Node Sts Inc Joined Name 1 M444 2014-02-07 10:25:00 test01 2 M440 2014-02-07 10:25:00 test02 # dlm_tool ls dlm lockspaces name clvmd id0x4104eefa flags 0x changemember 2 joined 1 remove 0 failed 0 seq 1,1 members 1 2 [2] Here I "echo c > /proc/sysrq-trigger" on node2 (test02), I can see crm_mon saying that node 2 is in unclean state and fencing kicks in (reboot node 2) # cman_tool nodes Node Sts Inc Joined Name 1 M440 2014-02-07 10:27:58 test01 2 X444 test02 # dlm_tool ls dlm lockspaces name clvmd id0x4104eefa flags 0x0004 kern_stop changemember 2 joined 1 remove 0 failed 0 seq 2,2 members 1 2 new changemember 1 joined 0 remove 1 failed 1 seq 3,3 new statuswait_messages 0 wait_condition 1 fencing new members 1 [3] So the above looks fine so far, to my untrained eye, dlm in kern_stop state while waiting on successful fence, and the node reboots and we have the following state: # cman_tool nodes Node Sts Inc Joined Name 1 M440 2014-02-07 10:27:58 test01 2 M456 2014-02-07 10:35:42 test02 # dlm_tool ls dlm lockspaces name clvmd id0x4104eefa flags 0x changemember 2 joined 1 remove 0 failed 0 seq 4,4 members 1 2 So it looks like dlm and cman seem to be working properly (again, I could be wrong, my untrained eye and all :) ) However, if I try to run any lvm status/clvm status commands then they still just hang. Could this be related to clvmd doing a check when cman is up and running but clvmd has not started yet (As I understand from Vladislav's previous email). Or do I have something fundamentally wrong with my fencing configuration. Here is a link to the "dlm_tool dump" at the time of the above "dlm_tool ls" (if it helps) http://pastebin.com/KV6YZWrN Again, thanks for all the info thus far. Thanks ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] node1 fencing itself after node2 being fenced
> > I would really love to see logs at this point. > Both from pacemaker and the system in general (and clvmd if it produces > any). > > Based on what you say below, there doesn't seem to be a good reason for > the hang (ie. no reason to be trying to fence anyone) > I will try to get some logs to you and Fabio today, I just want to enable debug logging for the cluster as Fabio suggested and will re-enable debug logging for clvmd (as suggested by Vladislav earlier in the thread). > > Right. I forgot. Sorry. Carry on :-) > There have been a bunch of discussions going on regarding clvmd in rhel7 > and they got muddled in my head. > No worries sir, it happens to the best of us :) ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] node1 fencing itself after node2 being fenced
> > Just a guess. Do you have startup fencing enabled in dlm-controld (I actually > do not remember if it is applicable to cman's version, but it exists in > dlm-4) or > cman? > If yes, then that may play its evil game, because imho it is not intended to > use with pacemaker which has its own startup fencing policy (if you redirect > fencing to pacemaker). > I can't seem to find the option to enable/disable startup fencing in either dlm_controld or cman. "dlm_controld -h" doesn’t list an option to enable/disable start up fencing. I had a quick read of the cman man page and I also don’t see any option mentioning startup fencing. Would you mind pointing me in the direction of the parameter to disable this in cman/dlm_controld please. PS: I am redirecting all fencing operations to pacemaker using the following directive: Thanks ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] node1 fencing itself after node2 being fenced
> i sometimes have the same situation. sleep ~30 seconds between startup > cman and clvmd helps a lot. > Thanks for the tip, I just tried this (added sleep 30 in the start section of case statement in cman script, but this did not resolve the issue for me), for some reason clvmd just refuses to start, I don’t see much debugging errors shooting up, so I cannot say for sure what clvmd is trying to do :( ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] node1 fencing itself after node2 being fenced
> > The 3rd node should (and needs to be) fenced at this point to allow the > cluster to continue. > Is this not happening? The fencing operation appears to complete successfully, here is the sequence: [1] All 3 nodes running properly [2] On node 3 I run "echo c > /proc/sysrq-trigger" which "hangs" node3 [3] The fence_test03 resources executes a fence operation on node 3 (fires a shutdown/startup on the vm) [4] dlm shows kern_stop state while node 3 is being fenced [5] node 3 reboots, and node 1 & 2 operate as normal (clvmd and gfs2 work properly, dlm notified that fence successful (2 members in each lock group)) [6] While node 3 is booting, cman starts properly then clvmd starts but hangs on boot [7] While node 3 is "hung" at the clvmd stage, node 1 & 2 are unable to perform lvm operations due to node 3 attempting to join the clvmd "group". Dlm shows that node 3 is a member, cman sees node 3 as a cluster member, however, pacemaker has not started as clvmd is not successfully started. Because pacemaker is not "up" and because I do not have clvmd as a resource definition, there is no fence performed if/when clvmd fails. Other than the above, fencing appears to be working properly. Are there some other fencing tests you may like me to perform to verify that fencing is working as expected? > > Did you specify on-fail=fence for the clvmd agent? > Hmmm, I don't have any clvmd agents defined within pacemaker at the moment as I am starting clvmd outside of pacemaker control. In my original post I had clvmd and dlm defined as a clone resource under pacemaker control. My understanding from the responses to that post was to remove those resources from pacemaker control and run clvmd on boot and dlm would be managed by cman startup. Are you saying that I should have dlm/clvmd defined as pacemaker resources and still have clvmd start on bootup? For example, originally I defined dlm/clvmd under pacemaker control as follows: pcs resource create dlm ocf:pacemaker:controld op monitor interval=30s on-fail=fence clone interleave=true ordered=true pcs resource create clvmd lsb:clvmd op monitor interval=30s on-fail=fence clone interleave=true ordered=true However, right now, the above two resource definitions have been removed from pacemaker. Thanks for your time (and others too) thus far in assisting me with this issue. Thanks ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] node1 fencing itself after node2 being fenced
> -Original Message- > From: Andrew Beekhof [mailto:and...@beekhof.net] > Sent: 17 February 2014 00:55 > To: li...@blueface.com; The Pacemaker cluster resource manager > Subject: Re: [Pacemaker] node1 fencing itself after node2 being fenced > > > If you have configured cman to use fence_pcmk, then all cman/dlm/clvmd > fencing operations are sent to Pacemaker. > If you aren't running pacemaker, then you have a big problem as no-one can > perform fencing. I have configured pacemaker as the resource manager and I have it enabled to start on boot-up too as follows: chkconfig cman on chkconfig clvmd on chkconfig pacemaker on > > I don't know if you are testing without pacemaker running, but if so you > would need to configure cman with real fencing devices. > I have been testing with pacemaker running and the fencing appears to be operating fine, the issue I seem to have is that clvmd is unable re-acquire its locks when attempting to rejoin the cluster after a fence operation, so it looks like clvmd just hangs when the startup script fires it off on boot-up. When the 3rd node is in this state (hung clvmd), then the other 2 nodes are unable to obtain locks from the third node as clvmd has hung, as an example, this is what happens when the 3rd node is hung at the clvmd startup phase after pacemaker has issued a fence operation (running pvs on node1) [root@test01 ~]# pvs Error locking on node test03: Command timed out Unable to obtain global lock. The dlm elements look fine to me here too: [root@test01 ~]# dlm_tool ls dlm lockspaces name cdr id0xa8054052 flags 0x0008 fs_reg changemember 2 joined 0 remove 1 failed 1 seq 2,2 members 1 2 name clvmd id0x4104eefa flags 0x changemember 3 joined 1 remove 0 failed 0 seq 3,3 members 1 2 3 So it looks like cman/dlm are operating properly, however, clvmd hangs and never exits so pacemaker never starts on the 3rd node. So the 3rd node is in "pending" state while clvmd is hung: [root@test02 ~]# crm_mon -Afr -1 Last updated: Mon Feb 17 15:52:28 2014 Last change: Mon Feb 17 15:43:16 2014 via cibadmin on test01 Stack: cman Current DC: test02 - partition with quorum Version: 1.1.10-14.el6_5.2-368c726 3 Nodes configured 15 Resources configured Node test03: pending Online: [ test01 test02 ] Full list of resources: fence_test01 (stonith:fence_vmware_soap):Started test01 fence_test02 (stonith:fence_vmware_soap):Started test02 fence_test03 (stonith:fence_vmware_soap):Started test01 Clone Set: fs_cdr-clone [fs_cdr] Started: [ test01 test02 ] Stopped: [ test03 ] Resource Group: sftp01-vip vip-001(ocf::heartbeat:IPaddr2): Started test01 vip-002(ocf::heartbeat:IPaddr2): Started test01 Resource Group: sftp02-vip vip-003(ocf::heartbeat:IPaddr2): Started test02 vip-004(ocf::heartbeat:IPaddr2): Started test02 Resource Group: sftp03-vip vip-005(ocf::heartbeat:IPaddr2): Started test02 vip-006(ocf::heartbeat:IPaddr2): Started test02 sftp01 (lsb:sftp01): Started test01 sftp02 (lsb:sftp02): Started test02 sftp03 (lsb:sftp03): Started test02 Node Attributes: * Node test01: * Node test02: * Node test03: Migration summary: * Node test03: * Node test02: * Node test01: ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] node1 fencing itself after node2 being fenced
> -Original Message- > From: Vladislav Bogdanov [mailto:bub...@hoster-ok.com] > Sent: 11 February 2014 03:44 > To: pacemaker@oss.clusterlabs.org > Subject: Re: [Pacemaker] node1 fencing itself after node2 being fenced > > Nope, it's Centos6. In few words, It is probably safer for you to stay with > cman, especially if you need GFS2. gfs_controld is not officially ported to > corosync2 and is obsolete in EL7 because communication between > gfs2 and dlm is moved to kernelspace there. > OK thanks, I may do some searching on how to compile corosync2 on centos 6 for a different cluster I need to setup that does not have the gfs2 requirement, thanks for the info. > > You need to fix that for sure. > I ended up rebuilding all my nodes and adding a third one to see if quorum may have been the issue, but the symtoms are still the same, I ended up stracing clvmd and it looks like it tries to write to /dev/misc/dlm_clvmd which doesn't exist on the "failed" node. I ended up attaching the trace to an existing bug listed in the CentOS bug tracker: http://bugs.centos.org/view.php?id=6853 This looks like something to do with clvmd and its locks, but dlm appears to be operating fine for me, I don't see any kern_stop flags for clvmd at all when the node is being fenced. It is a strange one because if I shutdown and reboot any of the nodes cleanly then everything comes back up ok, however, when I simulate failure, this is where the issue comes in. > > Strange message, looks like something is bound to that port already. > You may want to try dlm in tcp mode btw. > I was unable to run dlm in tcp mode as I have dual-homed interfaces, so dlm won't run in tcp mode in this case :) Thanks for recommendation though ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] node1 fencing itself after node2 being fenced
-Original Message- From: Vladislav Bogdanov [mailto:bub...@hoster-ok.com] Sent: 10 February 2014 13:27 To: pacemaker@oss.clusterlabs.org Subject: Re: [Pacemaker] node1 fencing itself after node2 being fenced I cannot really recall if it hangs or returns error for that (I moved to corosync2 long ago). Are you running corosync2 on RHEL7 beta? Are we able to run corosync2 on CentOS 6/RHEL 6? Anyways you probably want to run clvmd with debugging enabled. iirc you have two choices here, either you'd need to stop running instance first and then run it in the console with -f -d1, or run clvmd -C -d2 to ask all running instances to start debug logging to syslog. I prefer first one, because modern syslogs do rate-limiting. And, you'd need to run lvm commands with debugging enabled too. Thanks for this tip, I have modified clvmd to run in debug mode ("clvmd -T60 -d 2 -I cman") and I notice that on node2 reboot, I don't see any logs for clvmd actually attempting to start, so it appears there is something wrong here with clvmd. However, I did try to manually stop/start clvmd on node2 after a reboot and these were the error logs reported: Feb 10 12:37:08 test02 kernel: dlm: connecting to 1 sctp association 2 Feb 10 12:38:00 test02 kernel: dlm: Using SCTP for communications Feb 10 12:38:00 test02 clvmd[2118]: Unable to create DLM lockspace for CLVM: Address already in use Feb 10 12:38:00 test02 kernel: dlm: Can't bind to port 21064 addr number 1 Feb 10 12:38:00 test02 kernel: dlm: cannot start dlm lowcomms -98 Feb 10 12:39:37 test02 kernel: dlm: Using SCTP for communications Feb 10 12:39:37 test02 clvmd[2137]: Unable to create DLM lockspace for CLVM: Address already in use Feb 10 12:39:37 test02 kernel: dlm: Can't bind to port 21064 addr number 1 Feb 10 12:39:37 test02 kernel: dlm: cannot start dlm lowcomms -98 Feb 10 12:47:21 test02 clvmd[2159]: Unable to create DLM lockspace for CLVM: Address already in use Feb 10 12:47:21 test02 kernel: dlm: Using SCTP for communications Feb 10 12:47:21 test02 kernel: dlm: Can't bind to port 21064 addr number 1 Feb 10 12:47:21 test02 kernel: dlm: cannot start dlm lowcomms -98 Feb 10 12:48:14 test02 kernel: dlm: closing connection to node 2 Feb 10 12:48:14 test02 kernel: dlm: closing connection to node 1 So it appears that the issue is with clvmd attempting to communicated with, I presume, dlm. I tried to do some searching on this error and it appears there is a bug report, if I recall correctly, around 2004, which was fixed, so I cannot see why this error is cropping up. Some other strangeness is, that if I reboot the node a couple times, it may start up properly on 2nd node and then things appear to work properly, however, while node 2 is "down" the clvmd on node1 is still in a "hung" state even though dlm appears to think everything is good. Have you come across this issue before? Thanks for your assistance thus far, I appreciate it. ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] node1 fencing itself after node2 being fenced
Hi All, OK, here is my testing using cman/clvmd enabled on system startup and clvmd outside of pacemaker control. I still seem to be getting the clvmd hang/fail situation even when running outside of pacemaker control, I cannot see off-hand where the issue is occurring, but maybe it is related to what Vladislav was saying where clvmd hangs if it is not running on a cluster node that has cman running, however, I have both cman/clvmd enable to start at boot. Here is a little synopsis of what appears to be happening here: [1] Everything is fine here, both nodes up and running: # cman_tool nodes Node Sts Inc Joined Name 1 M444 2014-02-07 10:25:00 test01 2 M440 2014-02-07 10:25:00 test02 # dlm_tool ls dlm lockspaces name clvmd id0x4104eefa flags 0x changemember 2 joined 1 remove 0 failed 0 seq 1,1 members 1 2 [2] Here I "echo c > /proc/sysrq-trigger" on node2 (test02), I can see crm_mon saying that node 2 is in unclean state and fencing kicks in (reboot node 2) # cman_tool nodes Node Sts Inc Joined Name 1 M440 2014-02-07 10:27:58 test01 2 X444 test02 # dlm_tool ls dlm lockspaces name clvmd id0x4104eefa flags 0x0004 kern_stop changemember 2 joined 1 remove 0 failed 0 seq 2,2 members 1 2 new changemember 1 joined 0 remove 1 failed 1 seq 3,3 new statuswait_messages 0 wait_condition 1 fencing new members 1 [3] So the above looks fine so far, to my untrained eye, dlm in kern_stop state while waiting on successful fence, and the node reboots and we have the following state: # cman_tool nodes Node Sts Inc Joined Name 1 M440 2014-02-07 10:27:58 test01 2 M456 2014-02-07 10:35:42 test02 # dlm_tool ls dlm lockspaces name clvmd id0x4104eefa flags 0x changemember 2 joined 1 remove 0 failed 0 seq 4,4 members 1 2 So it looks like dlm and cman seem to be working properly (again, I could be wrong, my untrained eye and all :) ) However, if I try to run any lvm status/clvm status commands then they still just hang. Could this be related to clvmd doing a check when cman is up and running but clvmd has not started yet (As I understand from Vladislav's previous email). Or do I have something fundamentally wrong with my fencing configuration. Here is a link to the "dlm_tool dump" at the time of the above "dlm_tool ls" (if it helps) http://pastebin.com/KV6YZWrN Again, thanks for all the info thus far. Thanks ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] node1 fencing itself after node2 being fenced
On 06/02/2014 04:30, Nikita Staroverov wrote: Why do you need clvmd as a cluster resource? If you start clvmd outside of a cluster your problem will be no problem at all. I was running it under pacemaker because it is a neat way of seeing dependant services. When I remove dlm/clvmd from pacemaker control, then, I cannot immediatly see that the shared file system has a dependancy on clvmd and dlm. I guess this is just a personal preference. However, I was testing dlm/clvmd outside of pacemaker control yesterday and my issue still persists, so I am wondering if there is something else amiss that I have not uncovered yet. I'm busy gathering logs to reply so will get back to it a little later today. ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] node1 fencing itself after node2 being fenced
On 06/02/2014 05:52, Vladislav Bogdanov wrote: Hi, I bet your problem comes from the LSB clvmd init script. Here is what it does do: === ... clustered_vgs() { ${lvm_vgdisplay} 2>/dev/null | \ awk 'BEGIN {RS="VG Name"} {if (/Clustered/) print $1;}' } clustered_active_lvs() { for i in $(clustered_vgs); do ${lvm_lvdisplay} $i 2>/dev/null | \ awk 'BEGIN {RS="LV Name"} {if (/[^N^O^T] available/) print $1;}' done } rh_status() { status $DAEMON } ... case "$1" in ... status) rh_status rtrn=$? if [ $rtrn = 0 ]; then cvgs="$(clustered_vgs)" echo Clustered Volume Groups: ${cvgs:-"(none)"} clvs="$(clustered_active_lvs)" echo Active clustered Logical Volumes: ${clvs:-"(none)"} fi ... esac exit $rtrn = So, it not only looks for status of daemon itself, but also tries to list volume groups. And this operation is blocked because fencing is still in progress, and the whole cLVM thing (as well as DLM itself and all other dependent services) is frozen. So your resource timeouts in monitor operation, and then pacemaker asks it to stop (unless you have on-fail=fence). Anyways, there is a big chance that stop will fail too, and that leads again to fencing. cLVM is very fragile in my opinion (although newer versions running on corosync2 stack seem to be much better). And it is probably still doesn't work well when managed by pacemaker in CMAN-based clusters, because it blocks globally if any node in the whole cluster is online at the cman layer but doesn't run clvmd (I checked last time with .99). And that was the same for all stacks, until was fixed for corosync (only 2?) stack recently. The problem with that is that you cannot just stop pacemaker on one node (f.e. for maintenance), you should immediately stop cman as well (or run clvmd in cman'ish way) - cLVM freezes on another node. This should be easily fixable in clvmd code, but nobody cares. Thanks for the explanation, this is interresting for me as I need a volume manager in the cluster to manager the shared file systems in case I need to resize for some reason. I think I may be coming up against something similar now that I am testing cman outside of the cluster, even though I have cman/clvmd enabled outside pacemaker the clvmd daemon still hangs even when the 2nd node has been rebooted due to a fence operation, when it (node 2) reboots, cman & clvmd starts, I can see both nodes as members using cman_tool, but clvmd still seems to have an issue, it just hangs, I cant see off-hand if dlm still thinks pacemaker is in the fence operation (or if it has already returned true for successful fence). I am still gathering logs and will post back to this thread once I have all my logs from yesterday and this morning. I dont suppose there is another volume manager available that would be cluster aware that anyone is aware of? Increasing timeout for LSB clvmd resource probably wont help you, because blocked (because of DLM waits for fencing) LVM operations iirc never finish. You may want to search for clvmd OCF resource-agent, it is available for SUSE I think. Although it is not perfect, it should work much better for you I will have a look around for this clvmd ocf agent, and see what is involverd in getting it to work on CentOS 6.5 if I dont have any success with the current recommendation for running it outside of pacemaker control. ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] node1 fencing itself after node2 being fenced
On 05/02/2014 18:57, Никита Староверов wrote: It seems to me, clvmd can't answer to pacemaker monitor operation in 30 sec because it is also locked by dlm. You don't need clvmd and dlvm resources on cman-based clusters. clvm can simply start after cman. both dlvm and fenced are configured by cman. Thanks for the tip, I was testing cman/clvmd outside of cluster resource management yestaerday, I have come accross another issue. I will reply to this thread once I have gathered up all the logs. ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] node1 fencing itself after node2 being fenced
On 05/02/2014 16:12, Digimer wrote: You say it's working now? If so, excellent. If you have any troubles though, please share your cluster.conf and 'pcs config show'. Hi Digimer, no its not working as I expect it to when I test a crash of node 2, clvmd goes in to a failed state and then node1 gets "shot in the head", other than that the config appears works fine with the minimal testing I have done so far :) I have attached the cluster.conf and pcs config files to the email (with minimal obfuscation). Thanks [root@test01 ~]# pcs config show Cluster Name: sftp-cluster Corosync Nodes: Pacemaker Nodes: test01 test02 Resources: Clone: dlm-clone Meta Attrs: interleave=true ordered=true Resource: dlm (class=ocf provider=pacemaker type=controld) Operations: monitor on-fail=fence interval=30s (dlm-monitor-interval-30s) Clone: clvmd-clone Meta Attrs: interleave=true ordered=true Resource: clvmd (class=lsb type=clvmd) Operations: monitor on-fail=fence interval=30s (clvmd-monitor-interval-30s) Clone: fs-cdr-clone Meta Attrs: interleave=true Resource: fs-cdr (class=ocf provider=heartbeat type=Filesystem) Attributes: device=/dev/appvg/cdrlv directory=/shared/cdr fstype=gfs2 options=defaults,noatime,nodiratime Operations: monitor on-fail=fence interval=10s (fs-cdr-monitor-interval-10s) Group: sftp01-vip Resource: vip-001 (class=ocf provider=heartbeat type=IPaddr2) Attributes: ip=10.6.0.16 cidr_netmask=24 nic=eth0 Operations: monitor interval=5s (vip-001-monitor-interval-5s) Resource: vip-002 (class=ocf provider=heartbeat type=IPaddr2) Attributes: ip=10.7.0.16 cidr_netmask=24 nic=eth1 Operations: monitor interval=5s (vip-002-monitor-interval-5s) Group: sftp02-vip Resource: vip-003 (class=ocf provider=heartbeat type=IPaddr2) Attributes: ip=10.6.0.17 cidr_netmask=24 nic=eth0 Operations: monitor interval=5s (vip-003-monitor-interval-5s) Resource: vip-004 (class=ocf provider=heartbeat type=IPaddr2) Attributes: ip=10.7.0.17 cidr_netmask=24 nic=eth1 Operations: monitor interval=5s (vip-004-monitor-interval-5s) Resource: sftp01 (class=lsb type=sftp01) Operations: monitor interval=30s (sftp01-monitor-interval-30s) Resource: sftp02 (class=lsb type=sftp02) Operations: monitor interval=30s (sftp02-monitor-interval-30s) Stonith Devices: Resource: fence_test01 (class=stonith type=fence_vmware_soap) Attributes: login=user passwd=password action=reboot ipaddr=vcenter_host port=TEST01 ssl=1 pcmk_host_list=test01 Operations: monitor interval=60s (fence_test01-monitor-interval-60s) Resource: fence_test02 (class=stonith type=fence_vmware_soap) Attributes: login=user passwd=password action=reboot ipaddr=vcenter_host port=TEST02 ssl=1 pcmk_host_list=test02 Operations: monitor interval=60s (fence_test02-monitor-interval-60s) Fencing Levels: Location Constraints: Resource: sftp01 Enabled on: test01 (score:INFINITY) (role: Started) (id:cli-prefer-sftp01) Resource: sftp01-vip Enabled on: test01 (score:100) (id:location-sftp01-vip-test01-100) Resource: sftp02 Enabled on: test02 (score:INFINITY) (role: Started) (id:cli-prefer-sftp02) Resource: sftp02-vip Enabled on: test02 (score:100) (id:location-sftp02-vip-test02-100) Ordering Constraints: start dlm-clone then start clvmd-clone (Mandatory) (id:order-dlm-clone-clvmd-clone-mandatory) start clvmd-clone then start fs-cdr-clone (Mandatory) (id:order-clvmd-clone-fs-cdr-clone-mandatory) start sftp01-vip then start sftp01 (Mandatory) (id:order-sftp01-vip-sftp01-mandatory) start sftp02-vip then start sftp02 (Mandatory) (id:order-sftp02-vip-sftp02-mandatory) Colocation Constraints: clvmd-clone with dlm-clone (INFINITY) (id:colocation-clvmd-clone-dlm-clone-INFINITY) fs-cdr-clone with clvmd-clone (INFINITY) (id:colocation-fs-cdr-clone-clvmd-clone-INFINITY) sftp01 with sftp01-vip (INFINITY) (id:colocation-sftp01-sftp01-vip-INFINITY) sftp02 with sftp02-vip (INFINITY) (id:colocation-sftp02-sftp02-vip-INFINITY) Cluster Properties: cluster-infrastructure: cman dc-version: 1.1.10-14.el6_5.2-368c726 last-lrm-refresh: 1391176104 no-quorum-policy: ignore stonith-enabled: true___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] node1 fencing itself after node2 being fenced
On 05/02/2014 13:44, Nikita Staroverov wrote: Your setup is completely wrong, sorry. You must use RHEL6 documentation not RHEL7. in short, you should create cman cluster according to RHEL6 docs, but use pacemaker instead of rgmanager and fence_pcmk as fence agent for cman. Thanks, for the info, however, I am already currently using cman for cluster management and pacemaker as the resource manager, this is how I created the cluster and it appears to be working ok, please let me know if this is not the correct method for CentOS/RHEL 6.5 --- ccs -f /etc/cluster/cluster.conf --createcluster sftp-cluster ccs -f /etc/cluster/cluster.conf --addnode test01 ccs -f /etc/cluster/cluster.conf --addalt test01 test01-alt ccs -f /etc/cluster/cluster.conf --addnode test02 ccs -f /etc/cluster/cluster.conf --addalt test02 test02-alt ccs -f /etc/cluster/cluster.conf --addfencedev pcmk agent=fence_pcmk ccs -f /etc/cluster/cluster.conf --addmethod pcmk-redirect test01 ccs -f /etc/cluster/cluster.conf --addmethod pcmk-redirect test02 ccs -f /etc/cluster/cluster.conf --addfenceinst pcmk test01 pcmk-redirect port=test01 ccs -f /etc/cluster/cluster.conf --addfenceinst pcmk test02 pcmk-redirect port=test02 ccs -f /etc/cluster/cluster.conf --setcman keyfile="/etc/corosync/authkey" transport="udpu" port="5405" ccs -f /etc/cluster/cluster.conf --settotem rrp_mode="active" sed -i.bak "s/.*CMAN_QUORUM_TIMEOUT=.*/CMAN_QUORUM_TIMEOUT=0/g" /etc/sysconfig/cman pcs stonith create fence_test01 fence_vmware_soap login="user" passwd="password" action="reboot" ipaddr="vcenter_host" port="TEST01" ssl="1" pcmk_host_list="test01" delay="15" pcs stonith create fence_test02 fence_vmware_soap login="user" passwd="password" action="reboot" ipaddr="vcenter_host" port="TEST02" ssl="1" pcmk_host_list="test02" pcs property set no-quorum-policy="ignore" pcs property set stonith-enabled="true" --- The above is taken directly from the pacemaker RHEL 6 2 node cluster quick start quide (except for the fence agent definitions). At this point the cluster comes up and cman_tool sees the two hosts as joined and cluster is communicating over the two rings defined. I couldnt find the equivilent "pcs" syntax to perform the above configuration, looking at the man page of pcs I couldnt track down how to, for example, set the security key file using pcs syntax. The DLM/CLVMD/GFS2 configuration was taken from the RHEL7 documentation as it illustrated how to set it up using pcs syntax, the configuration commands appear to work fine and the services appear to be configured correctly as pacemaker starts services properly, the cluster appears to work properly if enable/disable the services using pcs sytax, and, if i manually stop/start the pacemaker service, or perform a clean shutdown/restart of the second node. The issue comes in when I test a crash of the second node, which is where I find the particular issue with fencing. Reading some archives of this mailing list there seem to be suggestions that dlm may be waiting on pacemaker to fence a node, which then cause a temporary "freeze" of the clvmd/gfs2 configuration, I underatand this is by design. However, when I test the 2nd node hand by doing a "echo c > /proc/sysrq-trigger", then i can see that stonithd begins fencing procedures around node2, att his point according to crm_mon the dlm service is stopped on node2 and started on node1, clvmd then goes in to a failed state, I presume, because of a possible timeout (I could be wrong), or, potentially, because it cannot communicate with clvmd on node2. When clvmd goes in to a failed state, this is when stonithd attempts to fence node1, and it does it successfully by shutting it down. Some archive messages seem to suggest that clvmd should be started outside of the cluster at system boot (cman -> clvmd -> pacemaker), however, my personal preference would be to have these services managed by the cluster infrastructure, which is why I am attempting to set it up in this manner. Is there anyone else out there that may be running a similar configuration dlm/clvmd/[gfs/gfs2/ocfs] under pacemaker control? Again, thanks for the info, I will do some more reading to ensure that I am using the correct syntax for pcs to configure these services. Thanks ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
[Pacemaker] node1 fencing itself after node2 being fenced
Hi All, First of all, thanks for the brilliant documentation at clusterlabs and the alteeva.ca tutorials! They helped me out a lot. I am relatively new to pacemaker but come from a Solaris background with cluster experience, I am now trying to get on board with pacemaker I have setup a 2 node cluster with a shared lun using pacemaker, cman, dlm, clvmd and gfs. I have configured 2 stonith devices each to fence either node. The issue I have is that when i test an unclean shutdown of the 2nd node, pacemaker goes ahead and fences the second node, but clvmd then goes in to a failed state on node 1 and then it fences itself (shuts down node 1). I suspect it has something to do with me setting the on-fail=fence for the dlm/clvmd services/RA's. DLM appears to be fine, but clvmd is the one that goes in to a failed state. I suspect I have an issue with timeouts here, but, being new to pacemaker I cannot see where, I am hoping a new pair of eyes can see where I am going wrong here. I am running, CentOS 6.5 in vmware, using the fence_vmware_soap stonith agents. Pacemaker is at version 1.1.10-14, CMAN is at version 3.0.12.1-59. I used the following tutorial to assist me in setting up dlm/clmvd/gfs2 on CentOS 6.5 (if it helps in the debugging) https://access.redhat.com/site/documentation/en-US/Red_Hat_Enterprise_Linux/7-Beta/html/Global_File_System_2/ch-clustsetup-GFS2.html Any assistance, tips, tricks, comments, criticisms are all welcome I have attached my cluster.conf if required, some node name obfuscation has been done. If you need any additional info, please dont hesitate to ask. Thanks ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org