Re: [Pacemaker] Pacemaker with Xen 4.3 problem

2014-07-09 Thread Tobias Reineck
Hello,

do you mean the Xen script in /usr/lib/ocf/resource.d/heartbeat/ ?
I also tried this to replace all xm with xl with no success.
Is it possible that you can show me you RA resource for Xen ?

Best regards
T. Reineck



Date: Tue, 8 Jul 2014 22:27:59 +0200
From: alxg...@gmail.com
To: pacemaker@oss.clusterlabs.org
Subject: Re: [Pacemaker] Pacemaker with Xen 4.3 problem

IIRC the xen RA uses 'xm'. However fixing the RAin is trivial and worked for me 
(if you're using the same RA)
Le 2014-07-08 21:39, Tobias Reineck tobias.rein...@hotmail.de a écrit :




Hello,

I try to build a XEN HA cluster with pacemaker/corosync.
Xen 4.3 works on all nodes and also the xen live migration works fine.
Pacemaker also works with the cluster virtual IP. 

But when I try to create a XEN OCF Heartbeat resource to get online, an error
appears:

##
Failed actions:
xen_dns_ha_start_0 on xen01.domain.dom 'unknown error' (1): call=31, 
status=complete, last-rc-change='Sun Jul 6 15:02:25 2014', queued=0ms, 
exec=555ms

xen_dns_ha_start_0 on xen02.domain.dom 'unknown error' (1): call=10, 
status=complete, last-rc-change='Sun Jul 6 15:15:09 2014', queued=0ms, 
exec=706ms
##

I added the resource with the command


crm configure primitive xen_dns_ha ocf:heartbeat:Xen \
params xmfile=/root/xen_storage/dns_dhcp/dns_dhcp.xen \
op monitor interval=10s \
op start interval=0s timeout=30s \

op stop interval=0s timeout=300s

in the /var/log/messages the following error is printed:
2014-07-08T21:09:19.885239+02:00 xen01 lrmd[3443]:   notice: 
operation_finished: xen_dns_ha_stop_0:18214:stderr [ Error: Unable to connect 
to xend: No such file or directory. Is xend running? ]


I use xen 4.3 with XL toolstack without xend .
Is it possible to use pacemaker with Xen 4.3 ?
Can anybody please help me ?

Best regards
T. Reineck


  

___

Pacemaker mailing list: Pacemaker@oss.clusterlabs.org

http://oss.clusterlabs.org/mailman/listinfo/pacemaker



Project Home: http://www.clusterlabs.org

Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf

Bugs: http://bugs.clusterlabs.org




___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org ___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] Pacemaker with Xen 4.3 problem

2014-07-09 Thread Alexandre
Actually I did it for the stonith resource agent external:xen0.
xm and xl are supposed to be semantically very close and as far as I can
see the ocf:heartbeat:Xen agent doesn't seem to use any xm command that
shouldn't work with xl.
What error do you have when using xl instead of xm?

Regards.


2014-07-09 8:39 GMT+02:00 Tobias Reineck tobias.rein...@hotmail.de:

 Hello,

 do you mean the Xen script in /usr/lib/ocf/resource.d/heartbeat/ ?
 I also tried this to replace all xm with xl with no success.
 Is it possible that you can show me you RA resource for Xen ?

 Best regards
 T. Reineck



 --
 Date: Tue, 8 Jul 2014 22:27:59 +0200
 From: alxg...@gmail.com
 To: pacemaker@oss.clusterlabs.org
 Subject: Re: [Pacemaker] Pacemaker with Xen 4.3 problem


 IIRC the xen RA uses 'xm'. However fixing the RAin is trivial and worked
 for me (if you're using the same RA)
 Le 2014-07-08 21:39, Tobias Reineck tobias.rein...@hotmail.de a écrit
 :

 Hello,

 I try to build a XEN HA cluster with pacemaker/corosync.
 Xen 4.3 works on all nodes and also the xen live migration works fine.
 Pacemaker also works with the cluster virtual IP.
 But when I try to create a XEN OCF Heartbeat resource to get online, an
 error
 appears:

 ##
 Failed actions:
 xen_dns_ha_start_0 on xen01.domain.dom 'unknown error' (1): call=31,
 status=complete, last-rc-change='Sun Jul 6 15:02:25 2014', queued=0ms,
 exec=555ms
 xen_dns_ha_start_0 on xen02.domain.dom 'unknown error' (1): call=10,
 status=complete, last-rc-change='Sun Jul 6 15:15:09 2014', queued=0ms,
 exec=706ms
 ##

 I added the resource with the command

 crm configure primitive xen_dns_ha ocf:heartbeat:Xen \
 params xmfile=/root/xen_storage/dns_dhcp/dns_dhcp.xen \
 op monitor interval=10s \
 op start interval=0s timeout=30s \
 op stop interval=0s timeout=300s

 in the /var/log/messages the following error is printed:
 2014-07-08T21:09:19.885239+02:00 xen01 lrmd[3443]:   notice:
 operation_finished: xen_dns_ha_stop_0:18214:stderr [ Error: Unable to
 connect to xend: No such file or directory. Is xend running? ]

 I use xen 4.3 with XL toolstack without xend .
 Is it possible to use pacemaker with Xen 4.3 ?
 Can anybody please help me ?

 Best regards
 T. Reineck



 ___
 Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker

 Project Home: http://www.clusterlabs.org
 Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
 Bugs: http://bugs.clusterlabs.org


 ___ Pacemaker mailing list:
 Pacemaker@oss.clusterlabs.org
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home:
 http://www.clusterlabs.org Getting started:
 http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs:
 http://bugs.clusterlabs.org

 ___
 Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker

 Project Home: http://www.clusterlabs.org
 Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
 Bugs: http://bugs.clusterlabs.org


___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] Pacemaker with Xen 4.3 problem

2014-07-09 Thread Kristoffer Grönlund
On Wed, 9 Jul 2014 08:39:09 +0200
Tobias Reineck tobias.rein...@hotmail.de wrote:

 Hello,
 
 do you mean the Xen script in /usr/lib/ocf/resource.d/heartbeat/ ?
 I also tried this to replace all xm with xl with no success.
 Is it possible that you can show me you RA resource for Xen ?
 
 Best regards
 T. Reineck
 

I have been working on an updated Xen RA which supports both xl and xm. 

The pull request is here:

https://github.com/ClusterLabs/resource-agents/pull/440

-- 
// Kristoffer Grönlund
// kgronl...@suse.com

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


[Pacemaker] Help with config please

2014-07-09 Thread Alex Samad - Yieldbroker
Hi

Config pacemaker on centos 6.5
pacemaker-cli-1.1.10-14.el6_5.3.x86_64
pacemaker-1.1.10-14.el6_5.3.x86_64
pacemaker-libs-1.1.10-14.el6_5.3.x86_64
pacemaker-cluster-libs-1.1.10-14.el6_5.3.x86_64

this is my config
Cluster Name: ybrp
Corosync Nodes:
 
Pacemaker Nodes:
 devrp1 devrp2 

Resources: 
 Resource: ybrpip (class=ocf provider=heartbeat type=IPaddr2)
  Attributes: ip=10.172.214.50 cidr_netmask=24 nic=eth0 
clusterip_hash=sourceip-sourceport 
  Meta Attrs: stickiness=0,migration-threshold=3,failure-timeout=600s 
  Operations: monitor on-fail=restart interval=5s timeout=20s 
(ybrpip-monitor-interval-5s)
 Clone: ybrpstat-clone
  Meta Attrs: globally-unique=false clone-max=2 clone-node-max=1 
  Resource: ybrpstat (class=ocf provider=yb type=proxy)
   Operations: monitor on-fail=restart interval=5s timeout=20s 
(ybrpstat-monitor-interval-5s)

Stonith Devices: 
Fencing Levels: 

Location Constraints:
Ordering Constraints:
  start ybrpstat-clone then start ybrpip (Mandatory) 
(id:order-ybrpstat-clone-ybrpip-mandatory)
Colocation Constraints:
  ybrpip with ybrpstat-clone (INFINITY) 
(id:colocation-ybrpip-ybrpstat-clone-INFINITY)

Cluster Properties:
 cluster-infrastructure: cman
 dc-version: 1.1.10-14.el6_5.3-368c726
 last-lrm-refresh: 1404892739
 no-quorum-policy: ignore
 stonith-enabled: false


I have my own resource file and I start stop the proxy service outside of 
pacemaker!

I had an interesting problem, where I did a vmware update on the linux box, 
which interrupted network activity.

Part of my monitor function on my script is to 1) test if the proxy process is 
running, 2) get a status page from the proxy and confirm it is 200


This is what I got in /var/log/messages

Jul  9 06:16:13 devrp1 crmd[6849]:  warning: update_failcount: Updating 
failcount for ybrpstat on devrp2 after failed monitor: rc=7 (update
=value++, time=1404850573)
Jul  9 06:16:13 devrp1 crmd[6849]:   notice: do_state_transition: State 
transition S_IDLE - S_POLICY_ENGINE [ input=I_PE_CALC cause=C_FSA_
INTERNAL origin=abort_transition_graph ]
Jul  9 06:16:13 devrp1 pengine[6848]:   notice: unpack_config: On loss of CCM 
Quorum: Ignore
Jul  9 06:16:13 devrp1 pengine[6848]:  warning: unpack_rsc_op: Processing 
failed op monitor for ybrpstat:0 on devrp2: not running (7)
Jul  9 06:16:13 devrp1 pengine[6848]:  warning: unpack_rsc_op: Processing 
failed op start for ybrpstat:1 on devrp1: unknown error (1)
Jul  9 06:16:13 devrp1 pengine[6848]:  warning: common_apply_stickiness: 
Forcing ybrpstat-clone away from devrp1 after 100 failures (ma
x=100)
Jul  9 06:16:13 devrp1 pengine[6848]:  warning: common_apply_stickiness: 
Forcing ybrpstat-clone away from devrp1 after 100 failures (ma
x=100)
Jul  9 06:16:13 devrp1 pengine[6848]:   notice: LogActions: Restart 
ybrpip#011(Started devrp2)
Jul  9 06:16:13 devrp1 pengine[6848]:   notice: LogActions: Recover 
ybrpstat:0#011(Started devrp2)
Jul  9 06:16:13 devrp1 pengine[6848]:   notice: process_pe_message: Calculated 
Transition 1054: /var/lib/pacemaker/pengine/pe-input-235.bz2
Jul  9 06:16:13 devrp1 pengine[6848]:   notice: unpack_config: On loss of CCM 
Quorum: Ignore
Jul  9 06:16:13 devrp1 pengine[6848]:  warning: unpack_rsc_op: Processing 
failed op monitor for ybrpstat:0 on devrp2: not running (7)
Jul  9 06:16:13 devrp1 pengine[6848]:  warning: unpack_rsc_op: Processing 
failed op start for ybrpstat:1 on devrp1: unknown error (1)
Jul  9 06:16:13 devrp1 pengine[6848]:  warning: common_apply_stickiness: 
Forcing ybrpstat-clone away from devrp1 after 100 failures (max=100)
Jul  9 06:16:13 devrp1 pengine[6848]:  warning: common_apply_stickiness: 
Forcing ybrpstat-clone away from devrp1 after 100 failures (max=100)
Jul  9 06:16:13 devrp1 pengine[6848]:   notice: LogActions: Restart 
ybrpip#011(Started devrp2)
Jul  9 06:16:13 devrp1 pengine[6848]:   notice: LogActions: Recover 
ybrpstat:0#011(Started devrp2)
Jul  9 06:16:13 devrp1 pengine[6848]:   notice: process_pe_message: Calculated 
Transition 1055: /var/lib/pacemaker/pengine/pe-input-236.bz2
Jul  9 06:16:13 devrp1 pengine[6848]:   notice: unpack_config: On loss of CCM 
Quorum: Ignore
Jul  9 06:16:13 devrp1 pengine[6848]:  warning: unpack_rsc_op: Processing 
failed op monitor for ybrpstat:0 on devrp2: not running (7)
Jul  9 06:16:13 devrp1 pengine[6848]:  warning: unpack_rsc_op: Processing 
failed op start for ybrpstat:1 on devrp1: unknown error (1)
Jul  9 06:16:13 devrp1 pengine[6848]:  warning: common_apply_stickiness: 
Forcing ybrpstat-clone away from devrp1 after 100 failures (max=100)
Jul  9 06:16:13 devrp1 pengine[6848]:  warning: common_apply_stickiness: 
Forcing ybrpstat-clone away from devrp1 after 100 failures (max=100)
Jul  9 06:16:13 devrp1 pengine[6848]:   notice: LogActions: Restart 
ybrpip#011(Started devrp2)
Jul  9 06:16:13 devrp1 pengine[6848]:   notice: LogActions: Recover 
ybrpstat:0#011(Started devrp2)


And it stay this way for the next 12 hours, until I got on.

I 

Re: [Pacemaker] Pacemaker with Xen 4.3 problem

2014-07-09 Thread Tobias Reineck
Hello,

here the log output
#
2014-07-09T10:49:01.315764+02:00 xen01 crmd[31294]:   notice: crm_add_logfile: 
Additional logging available in /var/log/pacemaker.log
2014-07-09T10:49:01.479820+02:00 xen01 crm_verify[31299]:   notice: 
crm_log_args: Invoked: crm_verify -V -p
2014-07-09T10:49:17.135725+02:00 xen01 crmd[31359]:   notice: crm_add_logfile: 
Additional logging available in /var/log/pacemaker.log
2014-07-09T10:49:32.683094+02:00 xen01 crmd[31367]:   notice: crm_add_logfile: 
Additional logging available in /var/log/pacemaker.log
2014-07-09T10:52:33.063416+02:00 xen01 crmd[31668]:   notice: crm_add_logfile: 
Additional logging available in /var/log/pacemaker.log
2014-07-09T10:52:33.224051+02:00 xen01 crm_verify[31673]:   notice: 
crm_log_args: Invoked: crm_verify -V -p
2014-07-09T10:52:33.378325+02:00 xen01 pengine[31686]:   notice: 
crm_add_logfile: Additional logging available in /var/log/pacemaker.log
2014-07-09T10:52:33.466427+02:00 xen01 crmd[3446]:   notice: 
do_state_transition: State transition S_IDLE - S_POLICY_ENGINE [ 
input=I_PE_CALC cause=C_FSA_INTERNAL origin=abort_transition_graph ]
2014-07-09T10:52:33.480118+02:00 xen01 pengine[3445]:   notice: unpack_config: 
On loss of CCM Quorum: Ignore
2014-07-09T10:52:33.480151+02:00 xen01 pengine[3445]:   notice: LogActions: 
Start   dnsdhcp#011(xen02.domain.dom)
2014-07-09T10:52:33.480161+02:00 xen01 pengine[3445]:   notice: 
process_pe_message: Calculated Transition 227: 
/var/lib/pacemaker/pengine/pe-input-240.bz2
2014-07-09T10:52:33.480431+02:00 xen01 crmd[3446]:   notice: te_rsc_command: 
Initiating action 7: monitor dnsdhcp_monitor_0 on xen02.domain.dom
2014-07-09T10:52:33.481059+02:00 xen01 crmd[3446]:   notice: te_rsc_command: 
Initiating action 5: monitor dnsdhcp_monitor_0 on xen01.domain.dom (local)
2014-07-09T10:52:33.586987+02:00 xen01 crmd[3446]:   notice: process_lrm_event: 
Operation dnsdhcp_monitor_0: not running (node=xen01.domain.dom, call=102, 
rc=7, cib-update=380, confirmed=true)
2014-07-09T10:52:33.611876+02:00 xen01 crmd[3446]:   notice: te_rsc_command: 
Initiating action 4: probe_complete probe_complete-xen01.domain.dom on 
xen01.domain.dom (local) - no waiting
2014-07-09T10:52:33.810913+02:00 xen01 crmd[3446]:   notice: te_rsc_command: 
Initiating action 6: probe_complete probe_complete-xen02.domain.dom on 
xen02.domain.dom - no waiting
2014-07-09T10:52:33.813788+02:00 xen01 crmd[3446]:   notice: te_rsc_command: 
Initiating action 10: start dnsdhcp_start_0 on xen02.domain.dom
2014-07-09T10:52:33.975340+02:00 xen01 crmd[3446]:  warning: status_from_rc: 
Action 10 (dnsdhcp_start_0) on xen02.domain.dom failed (target: 0 vs. rc: 1): 
Error
2014-07-09T10:52:33.975412+02:00 xen01 crmd[3446]:  warning: update_failcount: 
Updating failcount for dnsdhcp on xen02.domain.dom after failed start: rc=1 
(update=INFINITY, time=1404895953)
2014-07-09T10:52:33.979271+02:00 xen01 crmd[3446]:   notice: 
abort_transition_graph: Transition aborted by dnsdhcp_start_0 'modify' on 
xen02.domain.dom: Event failed 
(magic=0:1;10:227:0:37f37c0c-b063-4225-a380-a41137f7d460, cib=0.94.3, 
source=match_graph_event:344, 0)
2014-07-09T10:52:33.984242+02:00 xen01 crmd[3446]:  warning: update_failcount: 
Updating failcount for dnsdhcp on xen02.domain.dom after failed start: rc=1 
(update=INFINITY, time=1404895953)
2014-07-09T10:52:33.985790+02:00 xen01 crmd[3446]:  warning: status_from_rc: 
Action 10 (dnsdhcp_start_0) on xen02.domain.dom failed (target: 0 vs. rc: 1): 
Error
2014-07-09T10:52:33.987069+02:00 xen01 crmd[3446]:  warning: update_failcount: 
Updating failcount for dnsdhcp on xen02.domain.dom after failed start: rc=1 
(update=INFINITY, time=1404895953)
2014-07-09T10:52:33.988034+02:00 xen01 crmd[3446]:  warning: update_failcount: 
Updating failcount for dnsdhcp on xen02.domain.dom after failed start: rc=1 
(update=INFINITY, time=1404895953)
2014-07-09T10:52:33.988729+02:00 xen01 crmd[3446]:   notice: run_graph: 
Transition 227 (Complete=6, Pending=0, Fired=0, Skipped=1, Incomplete=0, 
Source=/var/lib/pacemaker/pengine/pe-input-240.bz2): Stopped
2014-07-09T10:52:33.989334+02:00 xen01 pengine[3445]:   notice: unpack_config: 
On loss of CCM Quorum: Ignore
2014-07-09T10:52:33.990014+02:00 xen01 pengine[3445]:  warning: 
unpack_rsc_op_failure: Processing failed op start for dnsdhcp on 
xen02.domain.dom: unknown error (1)
2014-07-09T10:52:33.990615+02:00 xen01 pengine[3445]:  warning: 
unpack_rsc_op_failure: Processing failed op start for dnsdhcp on 
xen02.domain.dom: unknown error (1)
2014-07-09T10:52:33.991355+02:00 xen01 pengine[3445]:   notice: LogActions: 
Recover dnsdhcp#011(Started xen02.domain.dom)
2014-07-09T10:52:33.992005+02:00 xen01 pengine[3445]:   notice: 
process_pe_message: Calculated Transition 228: 
/var/lib/pacemaker/pengine/pe-input-241.bz2
2014-07-09T10:52:34.040477+02:00 xen01 pengine[3445]:   notice: 

[Pacemaker] strange error

2014-07-09 Thread divinesecret

Hi,


just wanted to ask maybe someone encountered such situation.
suddenly cluster fails:

Jul  9 04:17:58 sdcsispprxfe1 IPaddr2(extVip51)[17292]: ERROR: Unknown 
interface [eth1] No such device.
Jul  9 04:17:58 sdcsispprxfe1 IPaddr2(extVip51)[17292]: ERROR: [findif] 
failed
Jul  9 04:17:58 sdcsispprxfe1 crmd[2116]:   notice: process_lrm_event: 
LRM operation extVip51_monitor_2 (call=57, rc=6, cib-update=2151, 
confirmed=false) not configured
Jul  9 04:17:58 sdcsispprxfe1 crmd[2116]:  warning: update_failcount: 
Updating failcount for extVip51 on sdcsispprxfe1 after failed monitor: 
rc=6 (update=value++, time=1404868678)
Jul  9 04:17:58 sdcsispprxfe1 crmd[2116]:   notice: 
do_state_transition: State transition S_IDLE - S_POLICY_ENGINE [ 
input=I_PE_CALC cause=C_FSA_INTERNAL origin=abort_transition_graph ]
Jul  9 04:17:58 sdcsispprxfe1 attrd[2114]:   notice: 
attrd_trigger_update: Sending flush op to all hosts for: 
fail-count-extVip51 (1)
Jul  9 04:17:58 sdcsispprxfe1 pengine[2115]:   notice: unpack_config: 
On loss of CCM Quorum: Ignore
Jul  9 04:17:58 sdcsispprxfe1 attrd[2114]:   notice: 
attrd_perform_update: Sent update 42: fail-count-extVip51=1
Jul  9 04:17:58 sdcsispprxfe1 attrd[2114]:   notice: 
attrd_trigger_update: Sending flush op to all hosts for: 
last-failure-extVip51 (1404868678)
Jul  9 04:17:58 sdcsispprxfe1 pengine[2115]:error: unpack_rsc_op: 
Preventing extVip51 from re-starting anywhere in the cluster : operation 
monitor failed 'not configured' (rc=6)
Jul  9 04:17:58 sdcsispprxfe1 pengine[2115]:  warning: unpack_rsc_op: 
Processing failed op monitor for extVip51 on sdcsispprxfe1: not 
configured (6)


restart was issued and then:

IPaddr2(extVip51)[23854]: INFO: Bringing device eth1 up




Version: 1.1.10-14.el6_5.3-368c726
centos 6.5


(other logs don't show eth1 going down or sthing similar)





___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


[Pacemaker] pacemaker stonith No such device

2014-07-09 Thread Dvorak Andreas
Dear all,

unfortunately my stonith does not work on my pacemaker cluster. If I do ifdown 
on the two cluster interconnect interfaces of server sv2827 the server sv2828 
want to fence the server sv2827, but the messages log says:error: 
remote_op_done: Operation reboot of sv2827-p1 by sv2828-p1 for 
crmd.7979@sv2828-p1.076062f0: No such device
Can somebody please help me?


Jul  9 12:42:49 sv2828 corosync[7749]:   [CMAN  ] quorum lost, blocking activity
Jul  9 12:42:49 sv2828 corosync[7749]:   [QUORUM] This node is within the 
non-primary component and will NOT provide any services.
Jul  9 12:42:49 sv2828 corosync[7749]:   [QUORUM] Members[1]: 1
Jul  9 12:42:49 sv2828 corosync[7749]:   [TOTEM ] A processor joined or left 
the membership and a new membership was formed.
Jul  9 12:42:49 sv2828 crmd[7979]:   notice: cman_event_callback: Membership 
1492: quorum lost
Jul  9 12:42:49 sv2828 crmd[7979]:   notice: crm_update_peer_state: 
cman_event_callback: Node sv2827-p1[2] - state is now lost (was member)
Jul  9 12:42:49 sv2828 crmd[7979]:  warning: match_down_event: No match for 
shutdown action on sv2827-p1
Jul  9 12:42:49 sv2828 crmd[7979]:   notice: peer_update_callback: 
Stonith/shutdown of sv2827-p1 not matched
Jul  9 12:42:49 sv2828 kernel: dlm: closing connection to node 2
Jul  9 12:42:49 sv2828 corosync[7749]:   [CPG   ] chosen downlist: sender r(0) 
ip(192.168.2.28) r(1) ip(192.168.3.28) ; members(old:2 left:1)
Jul  9 12:42:49 sv2828 crmd[7979]:   notice: do_state_transition: State 
transition S_IDLE - S_INTEGRATION [ input=I_NODE_JOIN cause=C_FSA_INTERNAL 
origin=check_join_state ]
Jul  9 12:42:49 sv2828 corosync[7749]:   [MAIN  ] Completed service 
synchronization, ready to provide service.
Jul  9 12:42:49 sv2828 crmd[7979]:  warning: match_down_event: No match for 
shutdown action on sv2827-p1
Jul  9 12:42:49 sv2828 crmd[7979]:   notice: peer_update_callback: 
Stonith/shutdown of sv2827-p1 not matched
Jul  9 12:42:49 sv2828 attrd[7977]:   notice: attrd_local_callback: Sending 
full refresh (origin=crmd)
Jul  9 12:42:49 sv2828 attrd[7977]:   notice: attrd_trigger_update: Sending 
flush op to all hosts for: last-failure-MYSQLFS (1404902183)
Jul  9 12:42:49 sv2828 attrd[7977]:   notice: attrd_trigger_update: Sending 
flush op to all hosts for: probe_complete (true)
Jul  9 12:42:49 sv2828 attrd[7977]:   notice: attrd_trigger_update: Sending 
flush op to all hosts for: last-failure-MYSQL (1404901921)
Jul  9 12:42:49 sv2828 pengine[7978]:   notice: unpack_config: On loss of CCM 
Quorum: Ignore
Jul  9 12:42:49 sv2828 pengine[7978]:  warning: pe_fence_node: Node sv2827-p1 
will be fenced because the node is no longer part of the cluster
Jul  9 12:42:49 sv2828 pengine[7978]:  warning: determine_online_status: Node 
sv2827-p1 is unclean
Jul  9 12:42:49 sv2828 pengine[7978]:  warning: custom_action: Action 
ipmi-fencing-sv2828_stop_0 on sv2827-p1 is unrunnable (offline)
Jul  9 12:42:49 sv2828 pengine[7978]:  warning: stage6: Scheduling Node 
sv2827-p1 for STONITH
Jul  9 12:42:49 sv2828 pengine[7978]:   notice: LogActions: Move
ipmi-fencing-sv2828#011(Started sv2827-p1 - sv2828-p1)
Jul  9 12:42:49 sv2828 pengine[7978]:  warning: process_pe_message: Calculated 
Transition 38: /var/lib/pacemaker/pengine/pe-warn-28.bz2
Jul  9 12:42:49 sv2828 crmd[7979]:   notice: te_fence_node: Executing reboot 
fencing operation (20) on sv2827-p1 (timeout=6)
Jul  9 12:42:49 sv2828 stonith-ng[7975]:   notice: handle_request: Client 
crmd.7979.6c35e3f1 wants to fence (reboot) 'sv2827-p1' with device '(any)'
Jul  9 12:42:49 sv2828 stonith-ng[7975]:   notice: initiate_remote_stonith_op: 
Initiating remote operation reboot for sv2827-p1: 
076062f0-eff3-4798-a504-16c5c5666a5b (0)
Jul  9 12:42:49 sv2828 stonith-ng[7975]:   notice: can_fence_host_with_device: 
ipmi-fencing-sv2827 can not fence sv2827-p1: static-list
Jul  9 12:42:49 sv2828 stonith-ng[7975]:   notice: can_fence_host_with_device: 
ipmi-fencing-sv2828 can not fence sv2827-p1: static-list
Jul  9 12:42:49 sv2828 stonith-ng[7975]:error: remote_op_done: Operation 
reboot of sv2827-p1 by sv2828-p1 for crmd.7979@sv2828-p1.076062f0: No such 
device
Jul  9 12:42:49 sv2828 crmd[7979]:   notice: tengine_stonith_callback: Stonith 
operation 8/20:38:0:71703806-8a7c-447f-a033-e3c26abd607c: No such device (-19)
Jul  9 12:42:49 sv2828 crmd[7979]:   notice: run_graph: Transition 38 
(Complete=1, Pending=0, Fired=0, Skipped=5, Incomplete=0, 
Source=/var/lib/pacemaker/pengine/pe-warn-28.bz2): Stopped

With the ipmitool I could test the correct work of the ipmi like power cycle.

pcs status
Cluster name: mysql-int-prod
Last updated: Wed Jul  9 12:46:43 2014
Last change: Wed Jul  9 12:41:14 2014 via crm_resource on sv2828-p1
Stack: cman
Current DC: sv2828-p1 - partition with quorum
Version: 1.1.10-1.el6_4.4-368c726
2 Nodes configured
5 Resources configured
Online: [ sv2827-p1 sv2828-p1 ]
Full list of resources:
ipmi-fencing-sv2827(stonith:fence_ipmilan):

[Pacemaker] CMAN and Pacemaker with IPv6

2014-07-09 Thread Teerapatr Kittiratanachai
Dear All,

I has implemented the HA on dual stack servers,
Firstly, I doesn't deploy IPv6 record on DNS yet. The CMAN and
PACEMAKER can work as normal.
But, after I create  record on DNS server, i found the error that
cann't start CMAN.

Are CMAN and PACEMAKER  support the IPv6?

Regards,
T. Kittiratanachai (Te)

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] Creating a safe cluster-node shutdown script (for when UPS goes OnBattery+LowBattery)

2014-07-09 Thread Giuseppe Ragusa
On Tue, Jul 8, 2014, at 02:59, Andrew Beekhof wrote:
 
 On 4 Jul 2014, at 3:16 pm, Giuseppe Ragusa giuseppe.rag...@hotmail.com 
 wrote:
 
  Hi all,
  I'm trying to create a script as per subject (on CentOS 6.5, 
  CMAN+Pacemaker, only DRBD+KVM active/passive resources; SNMP-UPS monitored 
  by NUT).
  
  Ideally I think that each node should stop (disable) all locally-running 
  VirtualDomain resources (doing so cleanly demotes than downs the DRBD 
  resources underneath), then put itself in standby and finally shutdown.
 
 Since the end goal is shutdown, why not just run 'pcs cluster stop' ?

I thought that this action would cause communication interruption (since 
Corosync would be not responding to the peer) and so cause the other node to 
stonith us; I know that ideally the other node too should perform pcs cluster 
stop in short, since the same UPS powers both, but I worry about timing issues 
(and races) in UPS monitoring since it is a large Enterprise UPS monitored by 
SNMP.

Furthermore I do not know what happens to running resources at pcs cluster 
stop: I infer from your suggestion that resources are brought down and not 
migrated on the other node, correct?

 Possibly with 'pcs cluster standby' first if you're worried that stopping the 
 resources might take too long.

I thought that pcs cluster standby would usually migrate the resources to the 
other node (I actually tried it and confirmed the expected behaviour); so this 
would risk to become a race with the timing of the other node standby, so this 
is why I took the hassle of explicitly and orderly stopping all locally-running 
resources in my script BEFORE putting the local node in standby.

 Pacemaker will stop everything in the required order and stop the node when 
 done... problem solved?

I thought that after a pcs cluster standby a regular shutdown -h of the 
operating system would cleanly bring down the cluster too, without the need for 
a pcs cluster stop, given that both Pacemaker and CMAN are correctly 
configured for automatic startup/shutdown as operating system services (SysV 
initscripts controlled by CentOS 6.5 Upstart, in my case).

Many thanks again for your always thought-provoking and informative answers!

Regards,
Giuseppe

  
  On further startup, manual intervention would be required to unstandby all 
  nodes and enable resources (nodes already in standby and resources already 
  disabled before blackout should be manually distinguished).
  
  Is this strategy conceptually safe?
  
  Unfortunately, various searches have turned out no prior art :)
  
  This is my tentative script (consider it in the public domain):
  
  
  #!/bin/bash
  
  # Note: pcs cluster status still has a small bug vs. CMAN-controlled 
  Corosync and would always return != 0
  pcs status  /dev/null 21
  STATUS=$?
  
  # Detect if cluster is running at all on local node
  # TODO: detect node already in standby and bypass this
  if [ ${STATUS} = 0 ]; then
  local_node=$(cman_tool status | grep -i 'Node[[:space:]]*name:' | sed 
  -e 's/^.*Node\s*name:\s*\([^[:space:]]*\).*$/\1/i')
  for local_resource in $(pcs status 2/dev/null | grep 
  ocf::heartbeat:VirtualDomain.*${local_node}\\s*\$ | awk '{print $1}'); do
  pcs resource disable ${local_resource}
  done
  # TODO: each resource disabling above may return without waiting for 
  complete stop - wait here for no more resources active? (but avoid 
  endless loops)
  pcs cluster standby ${local_node}
  fi
  
  # Shut down gracefully anyway at the end
  /sbin/shutdown -h +0
  
  
  
  Comments/suggestions/improvements are more than welcome.
  
  Many thanks in advance.
  
  Regards,
  Giuseppe
  
  ___
  Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
  http://oss.clusterlabs.org/mailman/listinfo/pacemaker
  
  Project Home: http://www.clusterlabs.org
  Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
  Bugs: http://bugs.clusterlabs.org
 
 ___
 Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker
 
 Project Home: http://www.clusterlabs.org
 Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
 Bugs: http://bugs.clusterlabs.org
 Email had 1 attachment:
 + signature.asc
   1k (application/pgp-signature)
-- 
  Giuseppe Ragusa
  giuseppe.rag...@fastmail.fm


___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf

Re: [Pacemaker] Pacemaker 1.1: cloned stonith resources require --force to be added to levels

2014-07-09 Thread Giuseppe Ragusa
On Tue, Jul 8, 2014, at 06:06, Andrew Beekhof wrote:
 
 On 5 Jul 2014, at 1:00 am, Giuseppe Ragusa giuseppe.rag...@hotmail.com 
 wrote:
 
  From: and...@beekhof.net
  Date: Fri, 4 Jul 2014 22:50:28 +1000
  To: pacemaker@oss.clusterlabs.org
  Subject: Re: [Pacemaker] Pacemaker 1.1: cloned stonith resources require
  --force to be added to levels
  
   
  On 4 Jul 2014, at 1:29 pm, Giuseppe Ragusa giuseppe.rag...@hotmail.com 
  wrote:
   
Hi all,
while creating a cloned stonith resource

Any particular reason you feel the need to clone it?

   In the end, I suppose it's only a purist mindset :) because it is a PDU 
   whose power outlets control both nodes, so
   its resource should be active (and monitored) on both nodes 
   independently.
   I understand that it would work anyway, leaving it not cloned and not 
   location-constrained
   just as regular, dedicated stonith devices would not need to be 
   location-constrained, right?
   
for multi-level STONITH on a fully-up-to-date CentOS 6.5 
(pacemaker-1.1.10-14.el6_5.3.x86_64):

pcs cluster cib stonith_cfg
pcs -f stonith_cfg stonith create pdu1 fence_apc action=off \
ipaddr=pdu1.verolengo.privatelan login=cluster passwd=test \ 
   
pcmk_host_map=cluster1.verolengo.privatelan:3,cluster1.verolengo.privatelan:4,cluster2.verolengo.privatelan:6,cluster2.verolengo.privatelan:7
 \
pcmk_host_check=static-list 
pcmk_host_list=cluster1.verolengo.privatelan,cluster2.verolengo.privatelan
 op monitor interval=240s
pcs -f stonith_cfg resource clone pdu1 pdu1Clone
pcs -f stonith_cfg stonith level add 2 cluster1.verolengo.privatelan 
pdu1Clone
pcs -f stonith_cfg stonith level add 2 cluster2.verolengo.privatelan 
pdu1Clone


the last 2 lines do not succeed unless I add the option --force and 
even so I still get errors when issuing verify:

[root@cluster1 ~]# pcs stonith level verify
Error: pdu1Clone is not a stonith id

If you check, I think you'll find there is no such resource as 
'pdu1Clone'.
I don't believe pcs lets you decide what the clone name is.
   
   You're right! (obviously ; )
   It's been automatically named pdu1-clone
   
   I suppose that there's still too much crmsh in my memory :)
   
   Anyway, removing the stonith level (to start from scratch) and using the 
   correct clone name does not change the result:
   
   [root@cluster1 etc]# pcs -f stonith_cfg stonith level add 2 
   cluster1.verolengo.privatelan pdu1-clone
   Error: pdu1-clone is not a stonith id (use --force to override)
   
  I bet we didn't think of that.
  What if you just do:
   
 pcs -f stonith_cfg stonith level add 2 cluster1.verolengo.privatelan pdu1
   
  Does that work?
   
  
  
  Yes, no errors at all and verify successful.

This initially passed by as a simple check for general sanity, while now, on 
second read, I think you were suggesting that I could clone as usual then 
configure with the primitive resource (which I usually avoid when working with 
regular clones) and it should automatically use instead the clone at runtime, 
correct?

  Remember that a full real test (to verify actual second level functionality 
  in presence of first level failure)
  is still pending for both the plain and cloned setup.
  
  Apropos: I read through the list archives that stonith resources (being 
  resources, after all)
  could themselves cause fencing (!) if failing (start, monitor, stop)
 
 stop just unsets a flag in stonithd.
 start does perform a monitor op though, which could fail.
 
 but by default only stop failure would result in fencing.

I though that start-failure-is-fatal was true by default, but maybe not for 
stonith resources.

  and that an ad-hoc
  on-fail setting could be used to prevent that.
  Maybe my aforementioned naive testing procedure (pull the iLO cable) could 
  provoke that?
 
 _shouldnt_ do so
 
  Would you suggest to configure such an on-fail option?
 
 again, shouldn't be necessary

Thanks again.

Regards,
Giuseppe

  Many thanks again for your help (and all your valuable work, of course!).
  
  Regards,
  Giuseppe
  ___
  Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
  http://oss.clusterlabs.org/mailman/listinfo/pacemaker
  
  Project Home: http://www.clusterlabs.org
  Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
  Bugs: http://bugs.clusterlabs.org
 
 ___
 Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker
 
 Project Home: http://www.clusterlabs.org
 Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
 Bugs: http://bugs.clusterlabs.org
 Email had 1 attachment:
 + signature.asc
   1k (application/pgp-signature)
-- 
  Giuseppe Ragusa
  

Re: [Pacemaker] Pacemaker 1.1: cloned stonith resources require --force to be added to levels

2014-07-09 Thread Andrew Beekhof

On 9 Jul 2014, at 10:43 pm, Giuseppe Ragusa giuseppe.rag...@hotmail.com wrote:

 On Tue, Jul 8, 2014, at 06:06, Andrew Beekhof wrote:
 
 On 5 Jul 2014, at 1:00 am, Giuseppe Ragusa giuseppe.rag...@hotmail.com 
 wrote:
 
 From: and...@beekhof.net
 Date: Fri, 4 Jul 2014 22:50:28 +1000
 To: pacemaker@oss.clusterlabs.org
 Subject: Re: [Pacemaker] Pacemaker 1.1: cloned stonith resources require
 --force to be added to levels
 
 
 On 4 Jul 2014, at 1:29 pm, Giuseppe Ragusa giuseppe.rag...@hotmail.com 
 wrote:
 
 Hi all,
 while creating a cloned stonith resource
 
 Any particular reason you feel the need to clone it?
 
 In the end, I suppose it's only a purist mindset :) because it is a PDU 
 whose power outlets control both nodes, so
 its resource should be active (and monitored) on both nodes 
 independently.
 I understand that it would work anyway, leaving it not cloned and not 
 location-constrained
 just as regular, dedicated stonith devices would not need to be 
 location-constrained, right?
 
 for multi-level STONITH on a fully-up-to-date CentOS 6.5 
 (pacemaker-1.1.10-14.el6_5.3.x86_64):
 
 pcs cluster cib stonith_cfg
 pcs -f stonith_cfg stonith create pdu1 fence_apc action=off \
ipaddr=pdu1.verolengo.privatelan login=cluster passwd=test \
 pcmk_host_map=cluster1.verolengo.privatelan:3,cluster1.verolengo.privatelan:4,cluster2.verolengo.privatelan:6,cluster2.verolengo.privatelan:7
  \
pcmk_host_check=static-list 
 pcmk_host_list=cluster1.verolengo.privatelan,cluster2.verolengo.privatelan
  op monitor interval=240s
 pcs -f stonith_cfg resource clone pdu1 pdu1Clone
 pcs -f stonith_cfg stonith level add 2 cluster1.verolengo.privatelan 
 pdu1Clone
 pcs -f stonith_cfg stonith level add 2 cluster2.verolengo.privatelan 
 pdu1Clone
 
 
 the last 2 lines do not succeed unless I add the option --force and 
 even so I still get errors when issuing verify:
 
 [root@cluster1 ~]# pcs stonith level verify
 Error: pdu1Clone is not a stonith id
 
 If you check, I think you'll find there is no such resource as 
 'pdu1Clone'.
 I don't believe pcs lets you decide what the clone name is.
 
 You're right! (obviously ; )
 It's been automatically named pdu1-clone
 
 I suppose that there's still too much crmsh in my memory :)
 
 Anyway, removing the stonith level (to start from scratch) and using the 
 correct clone name does not change the result:
 
 [root@cluster1 etc]# pcs -f stonith_cfg stonith level add 2 
 cluster1.verolengo.privatelan pdu1-clone
 Error: pdu1-clone is not a stonith id (use --force to override)
 
 I bet we didn't think of that.
 What if you just do:
 
   pcs -f stonith_cfg stonith level add 2 cluster1.verolengo.privatelan pdu1
 
 Does that work?
 
 
 
 Yes, no errors at all and verify successful.
 
 This initially passed by as a simple check for general sanity, while now, on 
 second read, I think you were suggesting that I could clone as usual then 
 configure with the primitive resource (which I usually avoid when working 
 with regular clones) and it should automatically use instead the clone at 
 runtime, correct?

right. but also consider not cloning it at all :)

 
 Remember that a full real test (to verify actual second level functionality 
 in presence of first level failure)
 is still pending for both the plain and cloned setup.
 
 Apropos: I read through the list archives that stonith resources (being 
 resources, after all)
 could themselves cause fencing (!) if failing (start, monitor, stop)
 
 stop just unsets a flag in stonithd.
 start does perform a monitor op though, which could fail.
 
 but by default only stop failure would result in fencing.
 
 I though that start-failure-is-fatal was true by default, but maybe not for 
 stonith resources.

fatal in the sense of won't attempt to run it there again, not the fence the 
whole node way

 
 and that an ad-hoc
 on-fail setting could be used to prevent that.
 Maybe my aforementioned naive testing procedure (pull the iLO cable) could 
 provoke that?
 
 _shouldnt_ do so
 
 Would you suggest to configure such an on-fail option?
 
 again, shouldn't be necessary
 
 Thanks again.
 
 Regards,
 Giuseppe
 
 Many thanks again for your help (and all your valuable work, of course!).
 
 Regards,
 Giuseppe
 ___
 Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker
 
 Project Home: http://www.clusterlabs.org
 Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
 Bugs: http://bugs.clusterlabs.org
 
 ___
 Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker
 
 Project Home: http://www.clusterlabs.org
 Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
 Bugs: http://bugs.clusterlabs.org
 Email had 1 attachment:
 + signature.asc
  1k 

Re: [Pacemaker] Creating a safe cluster-node shutdown script (for when UPS goes OnBattery+LowBattery)

2014-07-09 Thread Andrew Beekhof

On 9 Jul 2014, at 10:28 pm, Giuseppe Ragusa giuseppe.rag...@hotmail.com wrote:

 On Tue, Jul 8, 2014, at 02:59, Andrew Beekhof wrote:
 
 On 4 Jul 2014, at 3:16 pm, Giuseppe Ragusa giuseppe.rag...@hotmail.com 
 wrote:
 
 Hi all,
 I'm trying to create a script as per subject (on CentOS 6.5, 
 CMAN+Pacemaker, only DRBD+KVM active/passive resources; SNMP-UPS monitored 
 by NUT).
 
 Ideally I think that each node should stop (disable) all locally-running 
 VirtualDomain resources (doing so cleanly demotes than downs the DRBD 
 resources underneath), then put itself in standby and finally shutdown.
 
 Since the end goal is shutdown, why not just run 'pcs cluster stop' ?
 
 I thought that this action would cause communication interruption (since 
 Corosync would be not responding to the peer) and so cause the other node to 
 stonith us;

No. Shutdown is a globally co-ordinated process.
We don't fence nodes we know shut down cleanly.

 I know that ideally the other node too should perform pcs cluster stop in 
 short, since the same UPS powers both, but I worry about timing issues (and 
 races) in UPS monitoring since it is a large Enterprise UPS monitored by 
 SNMP.
 
 Furthermore I do not know what happens to running resources at pcs cluster 
 stop: I infer from your suggestion that resources are brought down and not 
 migrated on the other node, correct?

If the other node is shutting down too, they'll simply be stopped.
Otherwise we'll try to move them.

 
 Possibly with 'pcs cluster standby' first if you're worried that stopping 
 the resources might take too long.
 
 I thought that pcs cluster standby would usually migrate the resources to 
 the other node (I actually tried it and confirmed the expected behaviour); so 
 this would risk to become a race with the timing of the other node standby,

Not really, at the point the second node runs 'standby' we'll stop trying to 
migrate services and just stop them everywhere.
Again, this is a centrally controlled process, timing isn't a problem.

 so this is why I took the hassle of explicitly and orderly stopping all 
 locally-running resources in my script BEFORE putting the local node in 
 standby.
 
 Pacemaker will stop everything in the required order and stop the node when 
 done... problem solved?
 
 I thought that after a pcs cluster standby a regular shutdown -h of the 
 operating system would cleanly bring down the cluster too,

It should do

 without the need for a pcs cluster stop, given that both Pacemaker and CMAN 
 are correctly configured for automatic startup/shutdown as operating system 
 services (SysV initscripts controlled by CentOS 6.5 Upstart, in my case).
 
 Many thanks again for your always thought-provoking and informative answers!
 
 Regards,
 Giuseppe
 
 
 On further startup, manual intervention would be required to unstandby all 
 nodes and enable resources (nodes already in standby and resources already 
 disabled before blackout should be manually distinguished).
 
 Is this strategy conceptually safe?
 
 Unfortunately, various searches have turned out no prior art :)
 
 This is my tentative script (consider it in the public domain):
 
 
 #!/bin/bash
 
 # Note: pcs cluster status still has a small bug vs. CMAN-controlled 
 Corosync and would always return != 0
 pcs status  /dev/null 21
 STATUS=$?
 
 # Detect if cluster is running at all on local node
 # TODO: detect node already in standby and bypass this
 if [ ${STATUS} = 0 ]; then
local_node=$(cman_tool status | grep -i 'Node[[:space:]]*name:' | sed 
 -e 's/^.*Node\s*name:\s*\([^[:space:]]*\).*$/\1/i')
for local_resource in $(pcs status 2/dev/null | grep 
 ocf::heartbeat:VirtualDomain.*${local_node}\\s*\$ | awk '{print $1}'); do
pcs resource disable ${local_resource}
done
# TODO: each resource disabling above may return without waiting for 
 complete stop - wait here for no more resources active? (but avoid 
 endless loops)
pcs cluster standby ${local_node}
 fi
 
 # Shut down gracefully anyway at the end
 /sbin/shutdown -h +0
 
 
 
 Comments/suggestions/improvements are more than welcome.
 
 Many thanks in advance.
 
 Regards,
 Giuseppe
 
 ___
 Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker
 
 Project Home: http://www.clusterlabs.org
 Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
 Bugs: http://bugs.clusterlabs.org
 
 ___
 Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker
 
 Project Home: http://www.clusterlabs.org
 Getting started: 

Re: [Pacemaker] CMAN and Pacemaker with IPv6

2014-07-09 Thread Andrew Beekhof

On 9 Jul 2014, at 9:15 pm, Teerapatr Kittiratanachai maillist...@gmail.com 
wrote:

 Dear All,
 
 I has implemented the HA on dual stack servers,
 Firstly, I doesn't deploy IPv6 record on DNS yet. The CMAN and
 PACEMAKER can work as normal.
 But, after I create  record on DNS server, i found the error that
 cann't start CMAN.
 
 Are CMAN and PACEMAKER  support the IPv6?

I don;t think pacemaker cares.
What errors did you get?


signature.asc
Description: Message signed with OpenPGP using GPGMail
___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] Cannot create more than 27 multistate resources

2014-07-09 Thread Andrew Beekhof

On 9 Jul 2014, at 6:49 pm, K Mehta kiranmehta1...@gmail.com wrote:

 Hi,
 
 [root@vsanqa11 ~]# rpm -qa | grep pcs ; rpm -qa | grep pace ; rpm -qa | grep 
 libqb; rpm -qa | grep coro; rpm -qa | grep cman
 pcs-0.9.90-2.el6.centos.2.noarch
 pacemaker-cli-1.1.10-14.el6_5.3.x86_64
 pacemaker-libs-1.1.10-14.el6_5.3.x86_64
 pacemaker-1.1.10-14.el6_5.3.x86_64
 pacemaker-cluster-libs-1.1.10-14.el6_5.3.x86_64
 libqb-devel-0.16.0-2.el6.x86_64
 libqb-0.16.0-2.el6.x86_64
 corosynclib-1.4.1-17.el6_5.1.x86_64
 corosync-1.4.1-17.el6_5.1.x86_64
 cman-3.0.12.1-59.el6_5.2.x86_64
 [root@vsanqa11 ~]# uname -a
 Linux vsanqa11 2.6.32-279.el6.x86_64 #1 SMP Fri Jun 22 12:19:21 UTC 2012 
 x86_64 x86_64 x86_64 GNU/Linux
 [root@vsanqa11 ~]# cat /etc/redhat-release
 CentOS release 6.3 (Final)
 
 
 Created 27 resources 
 
 [root@vsanqa11 ~]# pcs status
 Cluster name: vsanqa11_12
 Last updated: Wed Jul  9 01:30:05 2014
 Last change: Wed Jul  9 01:24:12 2014 via cibadmin on vsanqa11
 Stack: cman
 Current DC: vsanqa11 - partition with quorum
 Version: 1.1.10-14.el6_5.3-368c726
 2 Nodes configured
 54 Resources configured
 
 
 Online: [ vsanqa11 vsanqa12 ]
 
 Full list of resources:
 
  Master/Slave Set: ms-0feba438-0f16-4de5-bf18-9d576cf4dd26 
 [vha-0feba438-0f16-4de5-bf18-9d576cf4dd26]
  Masters: [ vsanqa12 ]
  Slaves: [ vsanqa11 ]
  Master/Slave Set: ms-329ad1bd-2f9c-483d-a052-270731aefd70 
 [vha-329ad1bd-2f9c-483d-a052-270731aefd70]
  Masters: [ vsanqa12 ]
  Slaves: [ vsanqa11 ]
  Master/Slave Set: ms-b5b9e8dc-87c9-4229-b425-870d1bc1f107 
 [vha-b5b9e8dc-87c9-4229-b425-870d1bc1f107]
  Masters: [ vsanqa12 ]
  Slaves: [ vsanqa11 ]
  Master/Slave Set: ms-56112cc8-4c56-4454-84ea-ba9b7b82dde7 
 [vha-56112cc8-4c56-4454-84ea-ba9b7b82dde7]
  Masters: [ vsanqa12 ]
  Slaves: [ vsanqa11 ]
  Master/Slave Set: ms-20fd5ed6-8e72-4a6b-9b1a-73bd96e1f253 
 [vha-20fd5ed6-8e72-4a6b-9b1a-73bd96e1f253]
  Masters: [ vsanqa12 ]
  Slaves: [ vsanqa11 ]
  Master/Slave Set: ms-77af4d7e-d78c-4799-a24a-7536d225 
 [vha-77af4d7e-d78c-4799-a24a-7536d225]
  Masters: [ vsanqa12 ]
  Slaves: [ vsanqa11 ]
  Master/Slave Set: ms-913f6e7e-932f-4c3f-9e7f-a70c4153b4c7 
 [vha-913f6e7e-932f-4c3f-9e7f-a70c4153b4c7]
  Masters: [ vsanqa12 ]
  Slaves: [ vsanqa11 ]
  Master/Slave Set: ms-3d3ae48c-1955-4c64-b951-f2a2621a70b7 
 [vha-3d3ae48c-1955-4c64-b951-f2a2621a70b7]
  Masters: [ vsanqa12 ]
  Slaves: [ vsanqa11 ]
  Master/Slave Set: ms-a8101da1-636f-483b-90a9-9a18fd5a5793 
 [vha-a8101da1-636f-483b-90a9-9a18fd5a5793]
  Masters: [ vsanqa12 ]
  Slaves: [ vsanqa11 ]
  Master/Slave Set: ms-290e0fdd-19c6-4b40-8046-0fa2c94a7320 
 [vha-290e0fdd-19c6-4b40-8046-0fa2c94a7320]
  Masters: [ vsanqa12 ]
  Slaves: [ vsanqa11 ]
  Master/Slave Set: ms-8341368b-6b6a-4bf0-bb0a-38b32ffec2f4 
 [vha-8341368b-6b6a-4bf0-bb0a-38b32ffec2f4]
  Masters: [ vsanqa12 ]
  Slaves: [ vsanqa11 ]
  Master/Slave Set: ms-6695f479-0d21-4388-9540-a440c32e0944 
 [vha-6695f479-0d21-4388-9540-a440c32e0944]
  Masters: [ vsanqa12 ]
  Slaves: [ vsanqa11 ]
  Master/Slave Set: ms-85fe517e-9a2b-4ac1-b6e1-e4da57b49969 
 [vha-85fe517e-9a2b-4ac1-b6e1-e4da57b49969]
  Masters: [ vsanqa12 ]
  Slaves: [ vsanqa11 ]
  Master/Slave Set: ms-5b9adf5c-3bb5-4353-8c5f-a3e1508f7c4a 
 [vha-5b9adf5c-3bb5-4353-8c5f-a3e1508f7c4a]
  Masters: [ vsanqa12 ]
  Slaves: [ vsanqa11 ]
  Master/Slave Set: ms-57b4b7a1-d3bb-4b94-b8dc-8a84e4a3de03 
 [vha-57b4b7a1-d3bb-4b94-b8dc-8a84e4a3de03]
  Masters: [ vsanqa12 ]
  Slaves: [ vsanqa11 ]
  Master/Slave Set: ms-b4fbac14-ef19-4861-93e5-14574e101484 
 [vha-b4fbac14-ef19-4861-93e5-14574e101484]
  Masters: [ vsanqa12 ]
  Slaves: [ vsanqa11 ]
  Master/Slave Set: ms-ec4836d3-b768-4abd-83bf-3a61143346ce 
 [vha-ec4836d3-b768-4abd-83bf-3a61143346ce]
  Masters: [ vsanqa12 ]
  Slaves: [ vsanqa11 ]
  Master/Slave Set: ms-28d6db1a-bdba-4fc0-b74c-ef74548cf714 
 [vha-28d6db1a-bdba-4fc0-b74c-ef74548cf714]
  Masters: [ vsanqa12 ]
  Slaves: [ vsanqa11 ]
  Master/Slave Set: ms-c6cfd538-17f4-4c4b-b259-731e2cac75f3 
 [vha-c6cfd538-17f4-4c4b-b259-731e2cac75f3]
  Masters: [ vsanqa12 ]
  Slaves: [ vsanqa11 ]
  Master/Slave Set: ms-06d19988-7549-4882-b780-fd598714ec7f 
 [vha-06d19988-7549-4882-b780-fd598714ec7f]
  Masters: [ vsanqa12 ]
  Slaves: [ vsanqa11 ]
  Master/Slave Set: ms-ec105b52-42b1-42f5-931f-249fbc2f16c4 
 [vha-ec105b52-42b1-42f5-931f-249fbc2f16c4]
  Masters: [ vsanqa12 ]
  Slaves: [ vsanqa11 ]
  Master/Slave Set: ms-6cc9a97b-9714-4cab-bf52-103c46b8593f 
 [vha-6cc9a97b-9714-4cab-bf52-103c46b8593f]
  Masters: [ vsanqa12 ]
  Slaves: [ vsanqa11 ]
  Master/Slave Set: ms-b80b4a69-67e2-43bd-8eba-f6c18f78d706 
 [vha-b80b4a69-67e2-43bd-8eba-f6c18f78d706]
  Masters: [ vsanqa12 ]
  Slaves: [ vsanqa11 ]
  Master/Slave Set: ms-24817d3c-98e0-4408-9bce-67d8bb2495db 
 [vha-24817d3c-98e0-4408-9bce-67d8bb2495db]
  Masters: [ vsanqa12 ]
  Slaves: 

Re: [Pacemaker] strange error

2014-07-09 Thread Andrew Beekhof
Is NetworkManager present?  Using dhcp for that interface?


On 9 Jul 2014, at 7:03 pm, divinesecret arvy...@artogama.lt wrote:

 Hi,
 
 
 just wanted to ask maybe someone encountered such situation.
 suddenly cluster fails:
 
 Jul  9 04:17:58 sdcsispprxfe1 IPaddr2(extVip51)[17292]: ERROR: Unknown 
 interface [eth1] No such device.
 Jul  9 04:17:58 sdcsispprxfe1 IPaddr2(extVip51)[17292]: ERROR: [findif] failed
 Jul  9 04:17:58 sdcsispprxfe1 crmd[2116]:   notice: process_lrm_event: LRM 
 operation extVip51_monitor_2 (call=57, rc=6, cib-update=2151, 
 confirmed=false) not configured
 Jul  9 04:17:58 sdcsispprxfe1 crmd[2116]:  warning: update_failcount: 
 Updating failcount for extVip51 on sdcsispprxfe1 after failed monitor: rc=6 
 (update=value++, time=1404868678)
 Jul  9 04:17:58 sdcsispprxfe1 crmd[2116]:   notice: do_state_transition: 
 State transition S_IDLE - S_POLICY_ENGINE [ input=I_PE_CALC 
 cause=C_FSA_INTERNAL origin=abort_transition_graph ]
 Jul  9 04:17:58 sdcsispprxfe1 attrd[2114]:   notice: attrd_trigger_update: 
 Sending flush op to all hosts for: fail-count-extVip51 (1)
 Jul  9 04:17:58 sdcsispprxfe1 pengine[2115]:   notice: unpack_config: On loss 
 of CCM Quorum: Ignore
 Jul  9 04:17:58 sdcsispprxfe1 attrd[2114]:   notice: attrd_perform_update: 
 Sent update 42: fail-count-extVip51=1
 Jul  9 04:17:58 sdcsispprxfe1 attrd[2114]:   notice: attrd_trigger_update: 
 Sending flush op to all hosts for: last-failure-extVip51 (1404868678)
 Jul  9 04:17:58 sdcsispprxfe1 pengine[2115]:error: unpack_rsc_op: 
 Preventing extVip51 from re-starting anywhere in the cluster : operation 
 monitor failed 'not configured' (rc=6)
 Jul  9 04:17:58 sdcsispprxfe1 pengine[2115]:  warning: unpack_rsc_op: 
 Processing failed op monitor for extVip51 on sdcsispprxfe1: not configured (6)
 
 restart was issued and then:
 
 IPaddr2(extVip51)[23854]: INFO: Bringing device eth1 up
 
 
 
 
 Version: 1.1.10-14.el6_5.3-368c726
 centos 6.5
 
 
 (other logs don't show eth1 going down or sthing similar)
 
 
 
 
 
 ___
 Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker
 
 Project Home: http://www.clusterlabs.org
 Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
 Bugs: http://bugs.clusterlabs.org



signature.asc
Description: Message signed with OpenPGP using GPGMail
___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] CMAN and Pacemaker with IPv6

2014-07-09 Thread Teerapatr Kittiratanachai
I not found any LOG message

/var/log/messages
...
Jul 10 07:44:19 nwh00 kernel: : DLM (built Jun 19 2014 21:16:01) installed
Jul 10 07:44:22 nwh00 pacemaker: Aborting startup of Pacemaker Cluster Manager
...

and this is what display when I try to start pacemaker

# /etc/init.d/pacemaker start
Starting cluster:
   Checking if cluster has been disabled at boot...[  OK  ]
   Checking Network Manager... [  OK  ]
   Global setup... [  OK  ]
   Loading kernel modules...   [  OK  ]
   Mounting configfs...[  OK  ]
   Starting cman... Cannot find node name in cluster.conf
Unable to get the configuration
Cannot find node name in cluster.conf
cman_tool: corosync daemon didn't start Check cluster logs for details
   [FAILED]
Stopping cluster:
   Leaving fence domain... [  OK  ]
   Stopping gfs_controld...[  OK  ]
   Stopping dlm_controld...[  OK  ]
   Stopping fenced...  [  OK  ]
   Stopping cman...[  OK  ]
   Unloading kernel modules... [  OK  ]
   Unmounting configfs...  [  OK  ]
Aborting startup of Pacemaker Cluster Manager

another one thing, according to the happened problem, I remove the
 record from DNS for now and map it in to /etc/hosts files
instead, as shown below.

/etc/hosts
...
2001:db8:0:1::1   node0.example.com
2001:db8:0:1::2   node1.example.com
...

Is there any configure that help me to got more log ?

On Thu, Jul 10, 2014 at 5:06 AM, Andrew Beekhof and...@beekhof.net wrote:

 On 9 Jul 2014, at 9:15 pm, Teerapatr Kittiratanachai maillist...@gmail.com 
 wrote:

 Dear All,

 I has implemented the HA on dual stack servers,
 Firstly, I doesn't deploy IPv6 record on DNS yet. The CMAN and
 PACEMAKER can work as normal.
 But, after I create  record on DNS server, i found the error that
 cann't start CMAN.

 Are CMAN and PACEMAKER  support the IPv6?

 I don;t think pacemaker cares.
 What errors did you get?

 ___
 Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker

 Project Home: http://www.clusterlabs.org
 Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
 Bugs: http://bugs.clusterlabs.org


___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] Pacemaker 1.1: cloned stonith resources require --force to be added to levels

2014-07-09 Thread Giuseppe Ragusa
On Thu, Jul 10, 2014, at 00:00, Andrew Beekhof wrote:
 
 On 9 Jul 2014, at 10:43 pm, Giuseppe Ragusa giuseppe.rag...@hotmail.com 
 wrote:
 
  On Tue, Jul 8, 2014, at 06:06, Andrew Beekhof wrote:
  
  On 5 Jul 2014, at 1:00 am, Giuseppe Ragusa giuseppe.rag...@hotmail.com 
  wrote:
  
  From: and...@beekhof.net
  Date: Fri, 4 Jul 2014 22:50:28 +1000
  To: pacemaker@oss.clusterlabs.org
  Subject: Re: [Pacemaker] Pacemaker 1.1: cloned stonith resources require  
  --force to be added to levels
  
  
  On 4 Jul 2014, at 1:29 pm, Giuseppe Ragusa giuseppe.rag...@hotmail.com 
  wrote:
  
  Hi all,
  while creating a cloned stonith resource
  
  Any particular reason you feel the need to clone it?
  
  In the end, I suppose it's only a purist mindset :) because it is a 
  PDU whose power outlets control both nodes, so
  its resource should be active (and monitored) on both nodes 
  independently.
  I understand that it would work anyway, leaving it not cloned and not 
  location-constrained
  just as regular, dedicated stonith devices would not need to be 
  location-constrained, right?
  
  for multi-level STONITH on a fully-up-to-date CentOS 6.5 
  (pacemaker-1.1.10-14.el6_5.3.x86_64):
  
  pcs cluster cib stonith_cfg
  pcs -f stonith_cfg stonith create pdu1 fence_apc action=off \
 ipaddr=pdu1.verolengo.privatelan login=cluster passwd=test \  

  pcmk_host_map=cluster1.verolengo.privatelan:3,cluster1.verolengo.privatelan:4,cluster2.verolengo.privatelan:6,cluster2.verolengo.privatelan:7
   \
 pcmk_host_check=static-list 
  pcmk_host_list=cluster1.verolengo.privatelan,cluster2.verolengo.privatelan
   op monitor interval=240s
  pcs -f stonith_cfg resource clone pdu1 pdu1Clone
  pcs -f stonith_cfg stonith level add 2 cluster1.verolengo.privatelan 
  pdu1Clone
  pcs -f stonith_cfg stonith level add 2 cluster2.verolengo.privatelan 
  pdu1Clone
  
  
  the last 2 lines do not succeed unless I add the option --force and 
  even so I still get errors when issuing verify:
  
  [root@cluster1 ~]# pcs stonith level verify
  Error: pdu1Clone is not a stonith id
  
  If you check, I think you'll find there is no such resource as 
  'pdu1Clone'.
  I don't believe pcs lets you decide what the clone name is.
  
  You're right! (obviously ; )
  It's been automatically named pdu1-clone
  
  I suppose that there's still too much crmsh in my memory :)
  
  Anyway, removing the stonith level (to start from scratch) and using the 
  correct clone name does not change the result:
  
  [root@cluster1 etc]# pcs -f stonith_cfg stonith level add 2 
  cluster1.verolengo.privatelan pdu1-clone
  Error: pdu1-clone is not a stonith id (use --force to override)
  
  I bet we didn't think of that.
  What if you just do:
  
pcs -f stonith_cfg stonith level add 2 cluster1.verolengo.privatelan 
  pdu1
  
  Does that work?
  
  
  
  Yes, no errors at all and verify successful.
  
  This initially passed by as a simple check for general sanity, while now, 
  on second read, I think you were suggesting that I could clone as usual 
  then configure with the primitive resource (which I usually avoid when 
  working with regular clones) and it should automatically use instead the 
  clone at runtime, correct?
 
 right. but also consider not cloning it at all :)

I understand that in your opinion there's almost no added value to cloned 
stonith resources, so I suppose that should a PDU-type resource happen to be 
running on the same node that it must now fence, it would be migrated first or 
something like that (since I understand that stonith resources cannot fence the 
node they are running on), right?
If it is so and there's no adverse effect whatsoever (not even a significant 
delay), I will promptly remove the clone and configure my second levels using 
the primitive PDU stonith resource, but if on the contrary you after all think 
that there could be some legitimate use for such clones, I could open an RFE 
in bugzilla for them to be recognized as stonith resources and used in forming 
levels (if you suggest so).

Anyway,  many thanks for you advice and insight, obviously :)

  Remember that a full real test (to verify actual second level 
  functionality in presence of first level failure)
  is still pending for both the plain and cloned setup.
  
  Apropos: I read through the list archives that stonith resources (being 
  resources, after all)
  could themselves cause fencing (!) if failing (start, monitor, stop)
  
  stop just unsets a flag in stonithd.
  start does perform a monitor op though, which could fail.
  
  but by default only stop failure would result in fencing.
  
  I though that start-failure-is-fatal was true by default, but maybe not for 
  stonith resources.
 
 fatal in the sense of won't attempt to run it there again, not the fence 
 the whole node way

Ah right, I remember now all the suggestions I found about migration-threshold, 

Re: [Pacemaker] Creating a safe cluster-node shutdown script (for when UPS goes OnBattery+LowBattery)

2014-07-09 Thread Giuseppe Ragusa
On Thu, Jul 10, 2014, at 00:06, Andrew Beekhof wrote:
 
 On 9 Jul 2014, at 10:28 pm, Giuseppe Ragusa giuseppe.rag...@hotmail.com 
 wrote:
 
  On Tue, Jul 8, 2014, at 02:59, Andrew Beekhof wrote:
  
  On 4 Jul 2014, at 3:16 pm, Giuseppe Ragusa giuseppe.rag...@hotmail.com 
  wrote:
  
  Hi all,
  I'm trying to create a script as per subject (on CentOS 6.5, 
  CMAN+Pacemaker, only DRBD+KVM active/passive resources; SNMP-UPS 
  monitored by NUT).
  
  Ideally I think that each node should stop (disable) all locally-running 
  VirtualDomain resources (doing so cleanly demotes than downs the DRBD 
  resources underneath), then put itself in standby and finally shutdown.
  
  Since the end goal is shutdown, why not just run 'pcs cluster stop' ?
  
  I thought that this action would cause communication interruption (since 
  Corosync would be not responding to the peer) and so cause the other node 
  to stonith us;
 
 No. Shutdown is a globally co-ordinated process.
 We don't fence nodes we know shut down cleanly.

Thanks for the clarification.
Now that you said it, it seems also logical and even obvious ;

  I know that ideally the other node too should perform pcs cluster stop in 
  short, since the same UPS powers both, but I worry about timing issues (and 
  races) in UPS monitoring since it is a large Enterprise UPS monitored by 
  SNMP.
  
  Furthermore I do not know what happens to running resources at pcs cluster 
  stop: I infer from your suggestion that resources are brought down and not 
  migrated on the other node, correct?
 
 If the other node is shutting down too, they'll simply be stopped.
 Otherwise we'll try to move them.

It's the moving that worries me :)

  Possibly with 'pcs cluster standby' first if you're worried that stopping 
  the resources might take too long.

I forgot to ask: in which way would a previous standby make the resources stop 
sooner?

  I thought that pcs cluster standby would usually migrate the resources to 
  the other node (I actually tried it and confirmed the expected behaviour); 
  so this would risk to become a race with the timing of the other node 
  standby,
 
 Not really, at the point the second node runs 'standby' we'll stop trying to 
 migrate services and just stop them everywhere.
 Again, this is a centrally controlled process, timing isn't a problem.

I understand that, eventually, timing won't be a problem and resources will 
eventually stop, but from your description I'm afraid that some delaying 
could result in the total shutdown process, arising from possibly 
unsynchronized UPS notifications on the nodes (first node starts standby, 
resources start to move, THEN second node starts standby).

So now I'm taking your advice and I'll modify the script to user cluster stop 
but, with the aim of avoiding the aforementioned delay (if it actually 
represents a possibility), I would like to ask you three questions:

*) if I simply issue a pcs cluster stop --all from the first node that gets 
notified of UPS critical status, do I risk any adverse effect when the other 
node asynchronously gives the same command some time later (before/after the 
whole cluster stop sequence completes)?

*) does the aforementioned pcs cluster stop --all command return only after 
the cluster stop sequence has actually/completely ended (so as to safely issue 
a shutdown -h now immediately afterwards)?

*) is the pcs cluster stop --all command known to work reliably on current 
CentOS 6.5? (I ask since I found some discussion around pcs cluster start 
related bugs)

Many thanks again for your invaluable help and insight.

Regards,
Giuseppe

  so this is why I took the hassle of explicitly and orderly stopping all 
  locally-running resources in my script BEFORE putting the local node in 
  standby.
  
  Pacemaker will stop everything in the required order and stop the node 
  when done... problem solved?
  
  I thought that after a pcs cluster standby a regular shutdown -h of the 
  operating system would cleanly bring down the cluster too,
 
 It should do
 
  without the need for a pcs cluster stop, given that both Pacemaker and 
  CMAN are correctly configured for automatic startup/shutdown as operating 
  system services (SysV initscripts controlled by CentOS 6.5 Upstart, in my 
  case).
  
  Many thanks again for your always thought-provoking and informative answers!
  
  Regards,
  Giuseppe
  
  
  On further startup, manual intervention would be required to unstandby 
  all nodes and enable resources (nodes already in standby and resources 
  already disabled before blackout should be manually distinguished).
  
  Is this strategy conceptually safe?
  
  Unfortunately, various searches have turned out no prior art :)
  
  This is my tentative script (consider it in the public domain):
  
  
  #!/bin/bash
  
  # Note: pcs cluster status still 

Re: [Pacemaker] CMAN and Pacemaker with IPv6

2014-07-09 Thread Teerapatr Kittiratanachai
OK, some problems are solved.
I use the incorrect hostname.

For now, the new problem has occured.

  Starting cman... Node address family does not match multicast address family
Unable to get the configuration
Node address family does not match multicast address family
cman_tool: corosync daemon didn't start Check cluster logs for details
   [FAILED]

How can i fix it? Or just assigned the multicast address in the configuration?

Regards,
Te

On Thu, Jul 10, 2014 at 7:52 AM, Teerapatr Kittiratanachai
maillist...@gmail.com wrote:
 I not found any LOG message

 /var/log/messages
 ...
 Jul 10 07:44:19 nwh00 kernel: : DLM (built Jun 19 2014 21:16:01) installed
 Jul 10 07:44:22 nwh00 pacemaker: Aborting startup of Pacemaker Cluster Manager
 ...

 and this is what display when I try to start pacemaker

 # /etc/init.d/pacemaker start
 Starting cluster:
Checking if cluster has been disabled at boot...[  OK  ]
Checking Network Manager... [  OK  ]
Global setup... [  OK  ]
Loading kernel modules...   [  OK  ]
Mounting configfs...[  OK  ]
Starting cman... Cannot find node name in cluster.conf
 Unable to get the configuration
 Cannot find node name in cluster.conf
 cman_tool: corosync daemon didn't start Check cluster logs for details
[FAILED]
 Stopping cluster:
Leaving fence domain... [  OK  ]
Stopping gfs_controld...[  OK  ]
Stopping dlm_controld...[  OK  ]
Stopping fenced...  [  OK  ]
Stopping cman...[  OK  ]
Unloading kernel modules... [  OK  ]
Unmounting configfs...  [  OK  ]
 Aborting startup of Pacemaker Cluster Manager

 another one thing, according to the happened problem, I remove the
  record from DNS for now and map it in to /etc/hosts files
 instead, as shown below.

 /etc/hosts
 ...
 2001:db8:0:1::1   node0.example.com
 2001:db8:0:1::2   node1.example.com
 ...

 Is there any configure that help me to got more log ?

 On Thu, Jul 10, 2014 at 5:06 AM, Andrew Beekhof and...@beekhof.net wrote:

 On 9 Jul 2014, at 9:15 pm, Teerapatr Kittiratanachai maillist...@gmail.com 
 wrote:

 Dear All,

 I has implemented the HA on dual stack servers,
 Firstly, I doesn't deploy IPv6 record on DNS yet. The CMAN and
 PACEMAKER can work as normal.
 But, after I create  record on DNS server, i found the error that
 cann't start CMAN.

 Are CMAN and PACEMAKER  support the IPv6?

 I don;t think pacemaker cares.
 What errors did you get?

 ___
 Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker

 Project Home: http://www.clusterlabs.org
 Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
 Bugs: http://bugs.clusterlabs.org


___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org