Re: [Pacemaker] Colocation set options (pcs syntax)

2014-02-28 Thread Asgaroth


pcs constraint colocation set fs_ldap-clone sftp01-vip ldap1 
sequential=true


Let me know if this does or doesn't work for you.


I have been testing this now for a couple days and I think I must be 
doing something wrong, firstly though, the command itself completes 
successfully:


# pcs constraint show --full

  Resource Sets:
set fs_ldap-clone sftp01-vip ldap1 sequential=true (id:pcs_rsc_set) 
(id:pcs_rsc_colocation)


However, if I try to test it by moving, for example, the "sftp01-vip" 
resource group to another node, then is does not move the ldap1 service 
with it, example below:


Cluster state before resource move:
http://pastebin.com/a13ZhyRq

Then I do "pcs resource move sftp01-vip bfievsftp02", which moves 
resources to the node (except the associated ldap1 service)


Cluster state after the move:
http://pastebin.com/BSyTBEhX

Full constraint list:
http://pastebin.com/ng6m4C1Z

Here is what I am trying to achieve:
[1] The sftp0[1-3]-vip groups each have a prefered node 
(sftp01-vip=node1, sftp02-vip=node2, sftp03-vip=node3

[2] The sftp0[1-3] lsb resources are colocated with sftp0[1-3]-vip groups
[3] The ldap[1-3] lsb resources are colocated with sftp0[1-3]-vip groups

I managed to achieve the above using logic contraints as listed in the 
constraint output, however, the sftp0[1-3] and ldap[1-3] lsb resources 
also depend on fs_cdr-clone and fs_ldap-clone respectively, being available.


I thought I would be able to achive that file system dependancy using 
the colocation set, but this does not seem to work the way I am 
expecting it to, or, quite possibly, my logic may be slightly(largely) 
off :)


How would I ensure, that in the case of a node failure, the vip group 
moves to a node which has the fs_cdr and fs_ldap file system resources 
available? If I can do that, then, I can keep the colocation rule for 
the sftp/ldap service with the vip group. Or am I thinking about this 
the wrong way around?


Any tips/suggestions would be appreciated.

Thanks


___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] Colocation set options (pcs syntax)

2014-02-25 Thread Asgaroth
> 
> I think there's an error in the man page (which I'll work on getting
fixed).

Thanks Chris.

> Can you try: (removing 'setoptions' from your command)
> 
> 
> pcs constraint colocation set fs_ldap-clone sftp01-vip ldap1
sequential=true
> 
> 
> Let me know if this does or doesn't work for you.
> 

I shall give this a go a little later today and get back to you 


___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


[Pacemaker] Colocation set options (pcs syntax)

2014-02-24 Thread Asgaroth
Hi All,

I have several resources that depend on a cloned share file system and vip
that need to be up and operational before the resource can start, I was
reading the pacemaker documentation and it looks like colocation sets is
what I am after. I can see in the documentation that you can define a
colocation set and set the sequential option to "true" if you need the
resources to start sequentially, I guess this then becomes an ordered
colocation set which is what I am after, documentation I was reading is
here:

http://clusterlabs.org/doc/en-US/Pacemaker/1.1/html/Pacemaker_Explained/s-re
source-sets-collocation.html

According to the pcs man page I can setup a colocation set as follows:

colocation set   [resourceN]... [setoptions] ...
   [set   ...] [setoptions
=...]

However when I run the following command to create the set:

pcs constraint colocation set fs_ldap-clone sftp01-vip ldap1 setoptions
sequential=true

I get an error stating:

Error: Unable to update cib
Call cib_replace failed (-203): Update does not conform to the configured
schema

And then a dump of the current running info base.

Am I reading the man page incorrectly, or is this a bug I need to report?

Thanks


___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] node1 fencing itself after node2 being fenced

2014-02-24 Thread Asgaroth
Just an update on this issue which has now been resolved.

 

The issue was with my cluster configuration, dlm + sctp does not play nice
with each other, I had to un-configure redundant rings and set rrp_mode to
"none" after which clvmd works as expected.

 

Thanks to all for your assistance in this issue.

 

From: Asgaroth [mailto:li...@blueface.com] 
Sent: 10 February 2014 11:46
To: 'Mailing List: Pacemaker'
Subject: RE: node1 fencing itself after node2 being fenced

 

Hi All,

 

OK, here is my testing using cman/clvmd enabled on system startup and clvmd
outside of pacemaker control. I still seem to be getting the clvmd hang/fail
situation even when running outside of pacemaker control, I cannot see
off-hand where the issue is occurring, but maybe it is related to what
Vladislav was saying where clvmd hangs if it is not running on a cluster
node that has cman running, however, I have both cman/clvmd enable to start
at boot. Here is a little synopsis of what appears to be happening here:

 

[1] Everything is fine here, both nodes up and running:

 

# cman_tool nodes

Node  Sts   Inc   Joined   Name

   1   M444   2014-02-07 10:25:00  test01

   2   M440   2014-02-07 10:25:00  test02

 

# dlm_tool ls

dlm lockspaces

name  clvmd

id0x4104eefa

flags 0x 

changemember 2 joined 1 remove 0 failed 0 seq 1,1

members   1 2

 

[2] Here I "echo c > /proc/sysrq-trigger" on node2 (test02), I can see
crm_mon saying that node 2 is in unclean state and fencing kicks in (reboot
node 2)

 

# cman_tool nodes

Node  Sts   Inc   Joined   Name

   1   M440   2014-02-07 10:27:58  test01

   2   X444  test02

 

# dlm_tool ls

dlm lockspaces

name  clvmd

id0x4104eefa

flags 0x0004 kern_stop

changemember 2 joined 1 remove 0 failed 0 seq 2,2

members   1 2 

new changemember 1 joined 0 remove 1 failed 1 seq 3,3

new statuswait_messages 0 wait_condition 1 fencing

new members   1

 

[3] So the above looks fine so far, to my untrained eye, dlm in kern_stop
state while waiting on successful fence, and the node reboots and we have
the following state:

 

# cman_tool nodes

Node  Sts   Inc   Joined   Name

   1   M440   2014-02-07 10:27:58  test01

   2   M456   2014-02-07 10:35:42  test02

 

# dlm_tool ls

dlm lockspaces

name  clvmd

id0x4104eefa

flags 0x 

changemember 2 joined 1 remove 0 failed 0 seq 4,4

members   1 2

 

So it looks like dlm and cman seem to be working properly (again, I could be
wrong, my untrained eye and all :) )

 

However, if I try to run any lvm status/clvm status commands then they still
just hang. Could this be related to clvmd doing a check when cman is up and
running but clvmd has not started yet (As I understand from Vladislav's
previous email). Or do I have something fundamentally wrong with my fencing
configuration.

 

Here is a link to the "dlm_tool dump" at the time of the above "dlm_tool ls"
(if it helps)

http://pastebin.com/KV6YZWrN

 

Again, thanks for all the info thus far.

 

Thanks

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] node1 fencing itself after node2 being fenced

2014-02-20 Thread Asgaroth
> 
> I would really love to see logs at this point.
> Both from pacemaker and the system in general (and clvmd if it produces
> any).
> 
> Based on what you say below, there doesn't seem to be a good reason for
> the hang (ie. no reason to be trying to fence anyone)
> 

I will try to get some logs to you and Fabio today, I just want to enable
debug logging for the cluster as Fabio suggested and will re-enable debug
logging for clvmd (as suggested by Vladislav earlier in the thread).

> 
> Right. I forgot. Sorry. Carry on :-)
> There have been a bunch of discussions going on regarding clvmd in rhel7
> and they got muddled in my head.
> 

No worries sir, it happens to the best of us :)



___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] node1 fencing itself after node2 being fenced

2014-02-18 Thread Asgaroth
> 
> Just a guess. Do you have startup fencing enabled in dlm-controld (I actually
> do not remember if it is applicable to cman's version, but it exists in 
> dlm-4) or
> cman?
> If yes, then that may play its evil game, because imho it is not intended to
> use with pacemaker which has its own startup fencing policy (if you redirect
> fencing to pacemaker).
> 

I can't seem to find the option to enable/disable startup fencing in either 
dlm_controld or cman.

"dlm_controld -h" doesn’t list an option to enable/disable start up fencing.
I had a quick read of the cman man page and I also don’t see any option 
mentioning startup fencing.

Would you mind pointing me in the direction of the parameter to disable this in 
cman/dlm_controld please.

PS: I am redirecting all fencing operations to pacemaker using the following 
directive:


Thanks


___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] node1 fencing itself after node2 being fenced

2014-02-18 Thread Asgaroth
> i sometimes have the same situation. sleep ~30 seconds between startup
> cman and clvmd helps a lot.
> 

Thanks for the tip, I just tried this (added sleep 30 in the start section of 
case statement in cman script, but this did not resolve the issue for me), for 
some reason clvmd just refuses to start, I don’t see much debugging errors 
shooting up, so I cannot say for sure what clvmd is trying to do :(


___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] node1 fencing itself after node2 being fenced

2014-02-18 Thread Asgaroth
> 
> The 3rd node should (and needs to be) fenced at this point to allow the
> cluster to continue.
> Is this not happening?

The fencing operation appears to complete successfully, here is the
sequence:

[1] All 3 nodes running properly
[2] On node 3 I run "echo c > /proc/sysrq-trigger" which "hangs" node3
[3] The fence_test03 resources executes a fence operation on node 3 (fires a
shutdown/startup on the vm)
[4] dlm shows kern_stop state while node 3 is being fenced
[5] node 3 reboots, and node 1 & 2 operate as normal (clvmd and gfs2 work
properly, dlm notified that fence successful (2 members in each lock group))
[6] While node 3 is booting, cman starts properly then clvmd starts but
hangs on boot
[7] While node 3 is "hung" at the clvmd stage, node 1 & 2 are unable to
perform lvm operations due to node 3 attempting to join the clvmd "group".
Dlm shows that node 3 is a member, cman sees node 3 as a cluster member,
however, pacemaker has not started as clvmd is not successfully started.

Because pacemaker is not "up" and because I do not have clvmd as a resource
definition, there is no fence performed if/when clvmd fails.

Other than the above, fencing appears to be working properly. Are there some
other fencing tests you may like me to perform to verify that fencing is
working as expected?

> 
> Did you specify on-fail=fence for the clvmd agent?
> 


Hmmm, I don't have any clvmd agents defined within pacemaker at the moment
as I am starting clvmd outside of pacemaker control.

In my original post I had clvmd and dlm defined as a clone resource under
pacemaker control. My understanding from the responses to that post was to
remove those resources from pacemaker control and run clvmd on boot and dlm
would be managed by cman startup. Are you saying that I should have
dlm/clvmd defined as pacemaker resources and still have clvmd start on
bootup?

For example, originally I defined dlm/clvmd under pacemaker control as
follows:

pcs resource create dlm ocf:pacemaker:controld op monitor interval=30s
on-fail=fence clone interleave=true ordered=true
pcs resource create clvmd lsb:clvmd op monitor interval=30s on-fail=fence
clone interleave=true ordered=true

However, right now, the above two resource definitions have been removed
from pacemaker.

Thanks for your time (and others too) thus far in assisting me with this
issue.

Thanks


___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] node1 fencing itself after node2 being fenced

2014-02-17 Thread Asgaroth
> -Original Message-
> From: Andrew Beekhof [mailto:and...@beekhof.net]
> Sent: 17 February 2014 00:55
> To: li...@blueface.com; The Pacemaker cluster resource manager
> Subject: Re: [Pacemaker] node1 fencing itself after node2 being fenced
> 
> 
> If you have configured cman to use fence_pcmk, then all cman/dlm/clvmd
> fencing operations are sent to Pacemaker.
> If you aren't running pacemaker, then you have a big problem as no-one can
> perform fencing.

I have configured pacemaker as the resource manager and I have it enabled to
start on boot-up too as follows:

chkconfig cman on
chkconfig clvmd on
chkconfig pacemaker on

> 
> I don't know if you are testing without pacemaker running, but if so you
> would need to configure cman with real fencing devices.
>

I have been testing with pacemaker running and the fencing appears to be
operating fine, the issue I seem to have is that clvmd is unable re-acquire
its locks when attempting to rejoin the cluster after a fence operation, so
it looks like clvmd just hangs when the startup script fires it off on
boot-up. When the 3rd node is in this state (hung clvmd), then the other 2
nodes are unable to obtain locks from the third node as clvmd has hung, as
an example, this is what happens when the 3rd node is hung at the clvmd
startup phase after pacemaker has issued a fence operation (running pvs on
node1)

[root@test01 ~]# pvs
  Error locking on node test03: Command timed out
  Unable to obtain global lock.
 
The dlm elements look fine to me here too:

[root@test01 ~]# dlm_tool ls
dlm lockspaces
name  cdr
id0xa8054052
flags 0x0008 fs_reg
changemember 2 joined 0 remove 1 failed 1 seq 2,2
members   1 2 

name  clvmd
id0x4104eefa
flags 0x 
changemember 3 joined 1 remove 0 failed 0 seq 3,3
members   1 2 3

So it looks like cman/dlm are operating properly, however, clvmd hangs and
never exits so pacemaker never starts on the 3rd node. So the 3rd node is in
"pending" state while clvmd is hung:

[root@test02 ~]# crm_mon -Afr -1
Last updated: Mon Feb 17 15:52:28 2014
Last change: Mon Feb 17 15:43:16 2014 via cibadmin on test01
Stack: cman
Current DC: test02 - partition with quorum
Version: 1.1.10-14.el6_5.2-368c726
3 Nodes configured
15 Resources configured


Node test03: pending
Online: [ test01 test02 ]

Full list of resources:

 fence_test01  (stonith:fence_vmware_soap):Started test01 
 fence_test02  (stonith:fence_vmware_soap):Started test02 
 fence_test03  (stonith:fence_vmware_soap):Started test01 
 Clone Set: fs_cdr-clone [fs_cdr]
 Started: [ test01 test02 ]
 Stopped: [ test03 ]
 Resource Group: sftp01-vip
 vip-001(ocf::heartbeat:IPaddr2):   Started test01 
 vip-002(ocf::heartbeat:IPaddr2):   Started test01 
 Resource Group: sftp02-vip
 vip-003(ocf::heartbeat:IPaddr2):   Started test02 
 vip-004(ocf::heartbeat:IPaddr2):   Started test02 
 Resource Group: sftp03-vip
 vip-005(ocf::heartbeat:IPaddr2):   Started test02 
 vip-006(ocf::heartbeat:IPaddr2):   Started test02 
 sftp01 (lsb:sftp01):   Started test01 
 sftp02 (lsb:sftp02):   Started test02 
 sftp03 (lsb:sftp03):   Started test02 

Node Attributes:
* Node test01:
* Node test02:
* Node test03:

Migration summary:
* Node test03: 
* Node test02: 
* Node test01:


___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] node1 fencing itself after node2 being fenced

2014-02-11 Thread Asgaroth
> -Original Message-
> From: Vladislav Bogdanov [mailto:bub...@hoster-ok.com]
> Sent: 11 February 2014 03:44
> To: pacemaker@oss.clusterlabs.org
> Subject: Re: [Pacemaker] node1 fencing itself after node2 being fenced
> 
> Nope, it's Centos6. In few words, It is probably safer for you to stay
with
> cman, especially if you need GFS2. gfs_controld is not officially ported
to
> corosync2 and is obsolete in EL7 because communication between
> gfs2 and dlm is moved to kernelspace there.
> 

OK thanks, I may do some searching on how to compile corosync2 on centos 6
for a different cluster I need to setup that does not have the gfs2
requirement, thanks for the info.

> 
> You need to fix that for sure.
> 

I ended up rebuilding all my nodes and adding a third one to see if quorum
may have been the issue, but the symtoms are still the same, I ended up
stracing clvmd and it looks like it tries to write to /dev/misc/dlm_clvmd
which doesn't exist on the "failed" node.
I ended up attaching the trace to an existing bug listed in the CentOS bug
tracker:  http://bugs.centos.org/view.php?id=6853
This looks like something to do with clvmd and its locks, but dlm appears to
be operating fine for me, I don't see any kern_stop flags for clvmd at all
when the node is being fenced. It is a strange one because if I shutdown and
reboot any of the nodes cleanly then everything comes back up ok, however,
when I simulate failure, this is where the issue comes in.

> 
> Strange message, looks like something is bound to that port already.
> You may want to try dlm in tcp mode btw.
> 

I was unable to run dlm in tcp mode as I have dual-homed interfaces, so dlm
won't run in tcp mode in this case :) Thanks for recommendation though



___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] node1 fencing itself after node2 being fenced

2014-02-10 Thread Asgaroth


-Original Message-
From: Vladislav Bogdanov [mailto:bub...@hoster-ok.com] 
Sent: 10 February 2014 13:27
To: pacemaker@oss.clusterlabs.org
Subject: Re: [Pacemaker] node1 fencing itself after node2 being fenced


I cannot really recall if it hangs or returns error for that (I moved to
corosync2 long ago).

Are you running corosync2 on RHEL7 beta? Are we able to run corosync2 on
CentOS 6/RHEL 6?

Anyways you probably want to run clvmd with debugging enabled.
iirc you have two choices here, either you'd need to stop running instance
first and then run it in the console with -f -d1, or run clvmd -C -d2 to ask
all running instances to start debug logging to syslog.
I prefer first one, because modern syslogs do rate-limiting.
And, you'd need to run lvm commands with debugging enabled too.

Thanks for this tip, I have modified clvmd to run in debug mode ("clvmd -T60
-d 2 -I cman") and I notice that on node2 reboot, I don't see any logs for
clvmd actually attempting to start, so it appears there is something wrong
here with clvmd. However, I did try to manually stop/start clvmd on node2
after a reboot and these were the error logs reported:

Feb 10 12:37:08 test02 kernel: dlm: connecting to 1 sctp association 2
Feb 10 12:38:00 test02 kernel: dlm: Using SCTP for communications
Feb 10 12:38:00 test02 clvmd[2118]: Unable to create DLM lockspace for CLVM:
Address already in use
Feb 10 12:38:00 test02 kernel: dlm: Can't bind to port 21064 addr number 1
Feb 10 12:38:00 test02 kernel: dlm: cannot start dlm lowcomms -98
Feb 10 12:39:37 test02 kernel: dlm: Using SCTP for communications
Feb 10 12:39:37 test02 clvmd[2137]: Unable to create DLM lockspace for CLVM:
Address already in use
Feb 10 12:39:37 test02 kernel: dlm: Can't bind to port 21064 addr number 1
Feb 10 12:39:37 test02 kernel: dlm: cannot start dlm lowcomms -98
Feb 10 12:47:21 test02 clvmd[2159]: Unable to create DLM lockspace for CLVM:
Address already in use
Feb 10 12:47:21 test02 kernel: dlm: Using SCTP for communications
Feb 10 12:47:21 test02 kernel: dlm: Can't bind to port 21064 addr number 1
Feb 10 12:47:21 test02 kernel: dlm: cannot start dlm lowcomms -98
Feb 10 12:48:14 test02 kernel: dlm: closing connection to node 2
Feb 10 12:48:14 test02 kernel: dlm: closing connection to node 1

So it appears that the issue is with clvmd attempting to communicated with,
I presume, dlm. I tried to do some searching on this error and it appears
there is a bug report, if I recall correctly, around 2004, which was fixed,
so I cannot see why this error is cropping up. Some other strangeness is,
that if I reboot the node a couple times, it may start up properly on 2nd
node and then things appear to work properly, however, while node 2 is
"down" the clvmd on node1 is still in a "hung" state even though dlm appears
to think everything is good. Have you come across this issue before?

Thanks for your assistance thus far, I appreciate it.


___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] node1 fencing itself after node2 being fenced

2014-02-10 Thread Asgaroth
Hi All,

 

OK, here is my testing using cman/clvmd enabled on system startup and clvmd
outside of pacemaker control. I still seem to be getting the clvmd hang/fail
situation even when running outside of pacemaker control, I cannot see
off-hand where the issue is occurring, but maybe it is related to what
Vladislav was saying where clvmd hangs if it is not running on a cluster
node that has cman running, however, I have both cman/clvmd enable to start
at boot. Here is a little synopsis of what appears to be happening here:

 

[1] Everything is fine here, both nodes up and running:

 

# cman_tool nodes

Node  Sts   Inc   Joined   Name

   1   M444   2014-02-07 10:25:00  test01

   2   M440   2014-02-07 10:25:00  test02

 

# dlm_tool ls

dlm lockspaces

name  clvmd

id0x4104eefa

flags 0x 

changemember 2 joined 1 remove 0 failed 0 seq 1,1

members   1 2

 

[2] Here I "echo c > /proc/sysrq-trigger" on node2 (test02), I can see
crm_mon saying that node 2 is in unclean state and fencing kicks in (reboot
node 2)

 

# cman_tool nodes

Node  Sts   Inc   Joined   Name

   1   M440   2014-02-07 10:27:58  test01

   2   X444  test02

 

# dlm_tool ls

dlm lockspaces

name  clvmd

id0x4104eefa

flags 0x0004 kern_stop

changemember 2 joined 1 remove 0 failed 0 seq 2,2

members   1 2 

new changemember 1 joined 0 remove 1 failed 1 seq 3,3

new statuswait_messages 0 wait_condition 1 fencing

new members   1

 

[3] So the above looks fine so far, to my untrained eye, dlm in kern_stop
state while waiting on successful fence, and the node reboots and we have
the following state:

 

# cman_tool nodes

Node  Sts   Inc   Joined   Name

   1   M440   2014-02-07 10:27:58  test01

   2   M456   2014-02-07 10:35:42  test02

 

# dlm_tool ls

dlm lockspaces

name  clvmd

id0x4104eefa

flags 0x 

changemember 2 joined 1 remove 0 failed 0 seq 4,4

members   1 2

 

So it looks like dlm and cman seem to be working properly (again, I could be
wrong, my untrained eye and all :) )

 

However, if I try to run any lvm status/clvm status commands then they still
just hang. Could this be related to clvmd doing a check when cman is up and
running but clvmd has not started yet (As I understand from Vladislav's
previous email). Or do I have something fundamentally wrong with my fencing
configuration.

 

Here is a link to the "dlm_tool dump" at the time of the above "dlm_tool ls"
(if it helps)

http://pastebin.com/KV6YZWrN

 

Again, thanks for all the info thus far.

 

Thanks

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] node1 fencing itself after node2 being fenced

2014-02-07 Thread Asgaroth

On 06/02/2014 04:30, Nikita Staroverov wrote:
Why do you need clvmd as a cluster resource? If you start clvmd 
outside of a cluster your problem will be no problem at all.


I was running it under pacemaker because it is a neat way of seeing 
dependant services. When I remove dlm/clvmd from pacemaker control, 
then, I cannot immediatly see that the shared file system has a 
dependancy on clvmd and dlm. I guess this is just a personal preference.


However, I was testing dlm/clvmd outside of pacemaker control yesterday 
and my issue still persists, so I am wondering if there is something 
else amiss that I have not uncovered yet. I'm busy gathering logs to 
reply so will get back to it a little later today.



___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] node1 fencing itself after node2 being fenced

2014-02-07 Thread Asgaroth

On 06/02/2014 05:52, Vladislav Bogdanov wrote:

Hi,

I bet your problem comes from the LSB clvmd init script.
Here is what it does do:

===
...
clustered_vgs() {
 ${lvm_vgdisplay} 2>/dev/null | \
 awk 'BEGIN {RS="VG Name"} {if (/Clustered/) print $1;}'
}

clustered_active_lvs() {
 for i in $(clustered_vgs); do
 ${lvm_lvdisplay} $i 2>/dev/null | \
 awk 'BEGIN {RS="LV Name"} {if (/[^N^O^T] available/) print $1;}'
 done
}

rh_status() {
 status $DAEMON
}
...
case "$1" in
...
   status)
 rh_status
 rtrn=$?
 if [ $rtrn = 0 ]; then
 cvgs="$(clustered_vgs)"
 echo Clustered Volume Groups: ${cvgs:-"(none)"}
 clvs="$(clustered_active_lvs)"
 echo Active clustered Logical Volumes: ${clvs:-"(none)"}
 fi
...
esac

exit $rtrn
=

So, it not only looks for status of daemon itself, but also tries to
list volume groups. And this operation is blocked because fencing is
still in progress, and the whole cLVM thing (as well as DLM itself and
all other dependent services) is frozen. So your resource timeouts in
monitor operation, and then pacemaker asks it to stop (unless you have
on-fail=fence). Anyways, there is a big chance that stop will fail too,
and that leads again to fencing. cLVM is very fragile in my opinion
(although newer versions running on corosync2 stack seem to be much
better). And it is probably still doesn't work well when managed by
pacemaker in CMAN-based clusters, because it blocks globally if any node
in the whole cluster is online at the cman layer but doesn't run clvmd
(I checked last time with .99). And that was the same for all stacks,
until was fixed for corosync (only 2?) stack recently. The problem with
that is that you cannot just stop pacemaker on one node (f.e. for
maintenance), you should immediately stop cman as well (or run clvmd in
cman'ish way) - cLVM freezes on another node. This should be easily
fixable in clvmd code, but nobody cares.


Thanks for the explanation, this is interresting for me as I need a 
volume manager in the cluster to manager the shared file systems in case 
I need to resize for some reason. I think I may be coming up against 
something similar now that I am testing cman outside of the cluster, 
even though I have cman/clvmd enabled outside pacemaker the clvmd daemon 
still hangs even when the 2nd node has been rebooted due to a fence 
operation, when it (node 2) reboots, cman & clvmd starts, I can see both 
nodes as members using cman_tool, but clvmd still seems to have an 
issue, it just hangs, I cant see off-hand if dlm still thinks pacemaker 
is in the fence operation (or if it has already returned true for 
successful fence). I am still gathering logs and will post back to this 
thread once I have all my logs from yesterday and this morning.


I dont suppose there is another volume manager available that would be 
cluster aware that anyone is aware of?




Increasing timeout for LSB clvmd resource probably wont help you,
because blocked (because of DLM waits for fencing) LVM operations iirc
never finish.

You may want to search for clvmd OCF resource-agent, it is available for
SUSE I think. Although it is not perfect, it should work much better for
you


I will have a look around for this clvmd ocf agent, and see what is 
involverd in getting it to work on CentOS 6.5 if I dont have any success 
with the current recommendation for running it outside of pacemaker control.



___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] node1 fencing itself after node2 being fenced

2014-02-07 Thread Asgaroth

On 05/02/2014 18:57, Никита Староверов wrote:

It seems to me, clvmd can't answer to pacemaker monitor operation in
30 sec because it is also locked by dlm.
You don't need clvmd and dlvm resources on cman-based clusters. clvm
can simply start after cman. both dlvm and fenced are configured by
cman.


Thanks for the tip, I was testing cman/clvmd outside of cluster resource 
management yestaerday, I have come accross another issue. I will reply 
to this thread once I have gathered up all the logs.


___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] node1 fencing itself after node2 being fenced

2014-02-05 Thread Asgaroth

On 05/02/2014 16:12, Digimer wrote:
You say it's working now? If so, excellent. If you have any troubles 
though, please share your cluster.conf and 'pcs config show'.




Hi Digimer, no its not working as I expect it to when I test a crash of 
node 2, clvmd goes in to a failed state and then node1 gets "shot in the 
head", other than that the config appears works fine with the minimal 
testing I have done so far :)


I have attached the cluster.conf and pcs config files to the email (with 
minimal obfuscation).


Thanks


  
  

  
  

  

  


  
  

  

  

  
  
  

  
  


  
  
[root@test01 ~]# pcs config show
Cluster Name: sftp-cluster
Corosync Nodes:
 
Pacemaker Nodes:
 test01 test02 

Resources: 
 Clone: dlm-clone
  Meta Attrs: interleave=true ordered=true 
  Resource: dlm (class=ocf provider=pacemaker type=controld)
   Operations: monitor on-fail=fence interval=30s (dlm-monitor-interval-30s)
 Clone: clvmd-clone
  Meta Attrs: interleave=true ordered=true 
  Resource: clvmd (class=lsb type=clvmd)
   Operations: monitor on-fail=fence interval=30s (clvmd-monitor-interval-30s)
 Clone: fs-cdr-clone
  Meta Attrs: interleave=true 
  Resource: fs-cdr (class=ocf provider=heartbeat type=Filesystem)
   Attributes: device=/dev/appvg/cdrlv directory=/shared/cdr fstype=gfs2 
options=defaults,noatime,nodiratime 
   Operations: monitor on-fail=fence interval=10s (fs-cdr-monitor-interval-10s)
 Group: sftp01-vip
  Resource: vip-001 (class=ocf provider=heartbeat type=IPaddr2)
   Attributes: ip=10.6.0.16 cidr_netmask=24 nic=eth0 
   Operations: monitor interval=5s (vip-001-monitor-interval-5s)
  Resource: vip-002 (class=ocf provider=heartbeat type=IPaddr2)
   Attributes: ip=10.7.0.16 cidr_netmask=24 nic=eth1 
   Operations: monitor interval=5s (vip-002-monitor-interval-5s)
 Group: sftp02-vip
  Resource: vip-003 (class=ocf provider=heartbeat type=IPaddr2)
   Attributes: ip=10.6.0.17 cidr_netmask=24 nic=eth0 
   Operations: monitor interval=5s (vip-003-monitor-interval-5s)
  Resource: vip-004 (class=ocf provider=heartbeat type=IPaddr2)
   Attributes: ip=10.7.0.17 cidr_netmask=24 nic=eth1 
   Operations: monitor interval=5s (vip-004-monitor-interval-5s)
 Resource: sftp01 (class=lsb type=sftp01)
  Operations: monitor interval=30s (sftp01-monitor-interval-30s)
 Resource: sftp02 (class=lsb type=sftp02)
  Operations: monitor interval=30s (sftp02-monitor-interval-30s)

Stonith Devices: 
 Resource: fence_test01 (class=stonith type=fence_vmware_soap)
  Attributes: login=user passwd=password action=reboot ipaddr=vcenter_host 
port=TEST01 ssl=1 pcmk_host_list=test01 
  Operations: monitor interval=60s (fence_test01-monitor-interval-60s)
 Resource: fence_test02 (class=stonith type=fence_vmware_soap)
  Attributes: login=user passwd=password action=reboot ipaddr=vcenter_host 
port=TEST02 ssl=1 pcmk_host_list=test02 
  Operations: monitor interval=60s (fence_test02-monitor-interval-60s)
Fencing Levels: 

Location Constraints:
  Resource: sftp01
Enabled on: test01 (score:INFINITY) (role: Started) (id:cli-prefer-sftp01)
  Resource: sftp01-vip
Enabled on: test01 (score:100) (id:location-sftp01-vip-test01-100)
  Resource: sftp02
Enabled on: test02 (score:INFINITY) (role: Started) (id:cli-prefer-sftp02)
  Resource: sftp02-vip
Enabled on: test02 (score:100) (id:location-sftp02-vip-test02-100)
Ordering Constraints:
  start dlm-clone then start clvmd-clone (Mandatory) 
(id:order-dlm-clone-clvmd-clone-mandatory)
  start clvmd-clone then start fs-cdr-clone (Mandatory) 
(id:order-clvmd-clone-fs-cdr-clone-mandatory)
  start sftp01-vip then start sftp01 (Mandatory) 
(id:order-sftp01-vip-sftp01-mandatory)
  start sftp02-vip then start sftp02 (Mandatory) 
(id:order-sftp02-vip-sftp02-mandatory)
Colocation Constraints:
  clvmd-clone with dlm-clone (INFINITY) 
(id:colocation-clvmd-clone-dlm-clone-INFINITY)
  fs-cdr-clone with clvmd-clone (INFINITY) 
(id:colocation-fs-cdr-clone-clvmd-clone-INFINITY)
  sftp01 with sftp01-vip (INFINITY) (id:colocation-sftp01-sftp01-vip-INFINITY)
  sftp02 with sftp02-vip (INFINITY) (id:colocation-sftp02-sftp02-vip-INFINITY)

Cluster Properties:
 cluster-infrastructure: cman
 dc-version: 1.1.10-14.el6_5.2-368c726
 last-lrm-refresh: 1391176104
 no-quorum-policy: ignore
 stonith-enabled: true___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] node1 fencing itself after node2 being fenced

2014-02-05 Thread Asgaroth

On 05/02/2014 13:44, Nikita Staroverov wrote:
Your setup is completely wrong, sorry. You must use RHEL6 
documentation not RHEL7.
in short, you should create cman cluster according to RHEL6 docs, but 
use pacemaker instead of rgmanager and fence_pcmk as fence agent for cman.


Thanks, for the info, however, I am already currently using cman for 
cluster management and pacemaker as the resource manager, this is how I 
created the cluster and it appears to be working ok, please let me know 
if this is not the correct method for CentOS/RHEL 6.5


---
ccs -f /etc/cluster/cluster.conf --createcluster sftp-cluster
ccs -f /etc/cluster/cluster.conf --addnode test01
ccs -f /etc/cluster/cluster.conf --addalt test01 test01-alt
ccs -f /etc/cluster/cluster.conf --addnode test02
ccs -f /etc/cluster/cluster.conf --addalt test02 test02-alt
ccs -f /etc/cluster/cluster.conf --addfencedev pcmk agent=fence_pcmk
ccs -f /etc/cluster/cluster.conf --addmethod pcmk-redirect test01
ccs -f /etc/cluster/cluster.conf --addmethod pcmk-redirect test02
ccs -f /etc/cluster/cluster.conf --addfenceinst pcmk test01 
pcmk-redirect port=test01
ccs -f /etc/cluster/cluster.conf --addfenceinst pcmk test02 
pcmk-redirect port=test02
ccs -f /etc/cluster/cluster.conf --setcman 
keyfile="/etc/corosync/authkey" transport="udpu" port="5405"

ccs -f /etc/cluster/cluster.conf --settotem rrp_mode="active"
sed -i.bak "s/.*CMAN_QUORUM_TIMEOUT=.*/CMAN_QUORUM_TIMEOUT=0/g" 
/etc/sysconfig/cman


pcs stonith create fence_test01 fence_vmware_soap login="user" 
passwd="password" action="reboot" ipaddr="vcenter_host" port="TEST01" 
ssl="1" pcmk_host_list="test01" delay="15"
pcs stonith create fence_test02 fence_vmware_soap login="user" 
passwd="password" action="reboot" ipaddr="vcenter_host" port="TEST02" 
ssl="1" pcmk_host_list="test02"


pcs property set no-quorum-policy="ignore"
pcs property set stonith-enabled="true"
---

The above is taken directly from the pacemaker RHEL 6 2 node cluster 
quick start quide (except for the fence agent definitions).


At this point the cluster comes up and cman_tool sees the two hosts as 
joined and cluster is communicating over the two rings defined. I 
couldnt find the equivilent "pcs" syntax to perform the above 
configuration, looking at the man page of pcs I couldnt track down how 
to, for example, set the security key file using pcs syntax.


The DLM/CLVMD/GFS2 configuration was taken from the RHEL7 documentation 
as it illustrated how to set it up using pcs syntax, the configuration 
commands appear to work fine and the services appear to be configured 
correctly as pacemaker starts services properly, the cluster appears to 
work properly if enable/disable the services using pcs sytax, and, if i 
manually stop/start the pacemaker service, or perform a clean 
shutdown/restart of the second node. The issue comes in when I test a 
crash of the second node, which is where I find the particular issue 
with fencing.


Reading some archives of this mailing list there seem to be suggestions 
that dlm may be waiting on pacemaker to fence a node, which then cause a 
temporary "freeze" of the clvmd/gfs2 configuration, I underatand this is 
by design. However, when I test the 2nd node hand by doing a "echo c > 
/proc/sysrq-trigger", then i can see that stonithd begins fencing 
procedures around node2, att his point according to crm_mon the dlm 
service is stopped on node2 and started on node1, clvmd then goes in to 
a failed state, I presume, because of a possible timeout (I could be 
wrong), or, potentially, because it cannot communicate with clvmd on 
node2. When clvmd goes in to a failed state, this is when stonithd 
attempts to fence node1, and it does it successfully by shutting it down.


Some archive messages seem to suggest that clvmd should be started 
outside of the cluster at system boot (cman -> clvmd -> pacemaker), 
however, my personal preference would be to have these services managed 
by the cluster infrastructure, which is why I am attempting to set it up 
in this manner.


Is there anyone else out there that may be running a similar 
configuration dlm/clvmd/[gfs/gfs2/ocfs] under pacemaker control?


Again, thanks for the info, I will do some more reading to ensure that I 
am using the correct syntax for pcs to configure these services.


Thanks
___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


[Pacemaker] node1 fencing itself after node2 being fenced

2014-02-05 Thread Asgaroth

Hi All,

First of all, thanks for the brilliant documentation at clusterlabs and 
the alteeva.ca tutorials! They helped me out a lot.


I am relatively new to pacemaker but come from a Solaris background with 
cluster experience, I am now trying to get on board with pacemaker


I have setup a 2 node cluster with a shared lun using pacemaker, cman, 
dlm, clvmd and gfs. I have configured 2 stonith devices each to fence 
either node.


The issue I have is that when i test an unclean shutdown of the 2nd 
node, pacemaker goes ahead and fences the second node, but clvmd then 
goes in to a failed state on node 1 and then it fences itself (shuts 
down node 1).


I suspect it has something to do with me setting the on-fail=fence for 
the dlm/clvmd services/RA's. DLM appears to be fine, but clvmd is the 
one that goes in to a failed state. I suspect I have an issue with 
timeouts here, but, being new to pacemaker I cannot see where, I am 
hoping a new pair of eyes can see where I am going wrong here.


I am running, CentOS 6.5 in vmware, using the fence_vmware_soap stonith 
agents. Pacemaker is at version 1.1.10-14, CMAN is at version 3.0.12.1-59.


I used the following tutorial to assist me in setting up dlm/clmvd/gfs2 
on CentOS 6.5 (if it helps in the debugging)


https://access.redhat.com/site/documentation/en-US/Red_Hat_Enterprise_Linux/7-Beta/html/Global_File_System_2/ch-clustsetup-GFS2.html 



Any assistance, tips, tricks, comments, criticisms are all welcome

I have attached my cluster.conf if required, some node name obfuscation 
has been done. If you need any additional info, please dont hesitate to 
ask.


Thanks

  
  

  
  

  

  


  
  

  

  

  
  
  

  
  


  
  
___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org