Re: [Pacemaker] Pacemaker with Xen 4.3 problem
Hello, do you mean the Xen script in /usr/lib/ocf/resource.d/heartbeat/ ? I also tried this to replace all xm with xl with no success. Is it possible that you can show me you RA resource for Xen ? Best regards T. Reineck Date: Tue, 8 Jul 2014 22:27:59 +0200 From: alxg...@gmail.com To: pacemaker@oss.clusterlabs.org Subject: Re: [Pacemaker] Pacemaker with Xen 4.3 problem IIRC the xen RA uses 'xm'. However fixing the RAin is trivial and worked for me (if you're using the same RA) Le 2014-07-08 21:39, Tobias Reineck tobias.rein...@hotmail.de a écrit : Hello, I try to build a XEN HA cluster with pacemaker/corosync. Xen 4.3 works on all nodes and also the xen live migration works fine. Pacemaker also works with the cluster virtual IP. But when I try to create a XEN OCF Heartbeat resource to get online, an error appears: ## Failed actions: xen_dns_ha_start_0 on xen01.domain.dom 'unknown error' (1): call=31, status=complete, last-rc-change='Sun Jul 6 15:02:25 2014', queued=0ms, exec=555ms xen_dns_ha_start_0 on xen02.domain.dom 'unknown error' (1): call=10, status=complete, last-rc-change='Sun Jul 6 15:15:09 2014', queued=0ms, exec=706ms ## I added the resource with the command crm configure primitive xen_dns_ha ocf:heartbeat:Xen \ params xmfile=/root/xen_storage/dns_dhcp/dns_dhcp.xen \ op monitor interval=10s \ op start interval=0s timeout=30s \ op stop interval=0s timeout=300s in the /var/log/messages the following error is printed: 2014-07-08T21:09:19.885239+02:00 xen01 lrmd[3443]: notice: operation_finished: xen_dns_ha_stop_0:18214:stderr [ Error: Unable to connect to xend: No such file or directory. Is xend running? ] I use xen 4.3 with XL toolstack without xend . Is it possible to use pacemaker with Xen 4.3 ? Can anybody please help me ? Best regards T. Reineck ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] Pacemaker with Xen 4.3 problem
Actually I did it for the stonith resource agent external:xen0. xm and xl are supposed to be semantically very close and as far as I can see the ocf:heartbeat:Xen agent doesn't seem to use any xm command that shouldn't work with xl. What error do you have when using xl instead of xm? Regards. 2014-07-09 8:39 GMT+02:00 Tobias Reineck tobias.rein...@hotmail.de: Hello, do you mean the Xen script in /usr/lib/ocf/resource.d/heartbeat/ ? I also tried this to replace all xm with xl with no success. Is it possible that you can show me you RA resource for Xen ? Best regards T. Reineck -- Date: Tue, 8 Jul 2014 22:27:59 +0200 From: alxg...@gmail.com To: pacemaker@oss.clusterlabs.org Subject: Re: [Pacemaker] Pacemaker with Xen 4.3 problem IIRC the xen RA uses 'xm'. However fixing the RAin is trivial and worked for me (if you're using the same RA) Le 2014-07-08 21:39, Tobias Reineck tobias.rein...@hotmail.de a écrit : Hello, I try to build a XEN HA cluster with pacemaker/corosync. Xen 4.3 works on all nodes and also the xen live migration works fine. Pacemaker also works with the cluster virtual IP. But when I try to create a XEN OCF Heartbeat resource to get online, an error appears: ## Failed actions: xen_dns_ha_start_0 on xen01.domain.dom 'unknown error' (1): call=31, status=complete, last-rc-change='Sun Jul 6 15:02:25 2014', queued=0ms, exec=555ms xen_dns_ha_start_0 on xen02.domain.dom 'unknown error' (1): call=10, status=complete, last-rc-change='Sun Jul 6 15:15:09 2014', queued=0ms, exec=706ms ## I added the resource with the command crm configure primitive xen_dns_ha ocf:heartbeat:Xen \ params xmfile=/root/xen_storage/dns_dhcp/dns_dhcp.xen \ op monitor interval=10s \ op start interval=0s timeout=30s \ op stop interval=0s timeout=300s in the /var/log/messages the following error is printed: 2014-07-08T21:09:19.885239+02:00 xen01 lrmd[3443]: notice: operation_finished: xen_dns_ha_stop_0:18214:stderr [ Error: Unable to connect to xend: No such file or directory. Is xend running? ] I use xen 4.3 with XL toolstack without xend . Is it possible to use pacemaker with Xen 4.3 ? Can anybody please help me ? Best regards T. Reineck ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] Pacemaker with Xen 4.3 problem
On Wed, 9 Jul 2014 08:39:09 +0200 Tobias Reineck tobias.rein...@hotmail.de wrote: Hello, do you mean the Xen script in /usr/lib/ocf/resource.d/heartbeat/ ? I also tried this to replace all xm with xl with no success. Is it possible that you can show me you RA resource for Xen ? Best regards T. Reineck I have been working on an updated Xen RA which supports both xl and xm. The pull request is here: https://github.com/ClusterLabs/resource-agents/pull/440 -- // Kristoffer Grönlund // kgronl...@suse.com ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
[Pacemaker] Help with config please
Hi Config pacemaker on centos 6.5 pacemaker-cli-1.1.10-14.el6_5.3.x86_64 pacemaker-1.1.10-14.el6_5.3.x86_64 pacemaker-libs-1.1.10-14.el6_5.3.x86_64 pacemaker-cluster-libs-1.1.10-14.el6_5.3.x86_64 this is my config Cluster Name: ybrp Corosync Nodes: Pacemaker Nodes: devrp1 devrp2 Resources: Resource: ybrpip (class=ocf provider=heartbeat type=IPaddr2) Attributes: ip=10.172.214.50 cidr_netmask=24 nic=eth0 clusterip_hash=sourceip-sourceport Meta Attrs: stickiness=0,migration-threshold=3,failure-timeout=600s Operations: monitor on-fail=restart interval=5s timeout=20s (ybrpip-monitor-interval-5s) Clone: ybrpstat-clone Meta Attrs: globally-unique=false clone-max=2 clone-node-max=1 Resource: ybrpstat (class=ocf provider=yb type=proxy) Operations: monitor on-fail=restart interval=5s timeout=20s (ybrpstat-monitor-interval-5s) Stonith Devices: Fencing Levels: Location Constraints: Ordering Constraints: start ybrpstat-clone then start ybrpip (Mandatory) (id:order-ybrpstat-clone-ybrpip-mandatory) Colocation Constraints: ybrpip with ybrpstat-clone (INFINITY) (id:colocation-ybrpip-ybrpstat-clone-INFINITY) Cluster Properties: cluster-infrastructure: cman dc-version: 1.1.10-14.el6_5.3-368c726 last-lrm-refresh: 1404892739 no-quorum-policy: ignore stonith-enabled: false I have my own resource file and I start stop the proxy service outside of pacemaker! I had an interesting problem, where I did a vmware update on the linux box, which interrupted network activity. Part of my monitor function on my script is to 1) test if the proxy process is running, 2) get a status page from the proxy and confirm it is 200 This is what I got in /var/log/messages Jul 9 06:16:13 devrp1 crmd[6849]: warning: update_failcount: Updating failcount for ybrpstat on devrp2 after failed monitor: rc=7 (update =value++, time=1404850573) Jul 9 06:16:13 devrp1 crmd[6849]: notice: do_state_transition: State transition S_IDLE - S_POLICY_ENGINE [ input=I_PE_CALC cause=C_FSA_ INTERNAL origin=abort_transition_graph ] Jul 9 06:16:13 devrp1 pengine[6848]: notice: unpack_config: On loss of CCM Quorum: Ignore Jul 9 06:16:13 devrp1 pengine[6848]: warning: unpack_rsc_op: Processing failed op monitor for ybrpstat:0 on devrp2: not running (7) Jul 9 06:16:13 devrp1 pengine[6848]: warning: unpack_rsc_op: Processing failed op start for ybrpstat:1 on devrp1: unknown error (1) Jul 9 06:16:13 devrp1 pengine[6848]: warning: common_apply_stickiness: Forcing ybrpstat-clone away from devrp1 after 100 failures (ma x=100) Jul 9 06:16:13 devrp1 pengine[6848]: warning: common_apply_stickiness: Forcing ybrpstat-clone away from devrp1 after 100 failures (ma x=100) Jul 9 06:16:13 devrp1 pengine[6848]: notice: LogActions: Restart ybrpip#011(Started devrp2) Jul 9 06:16:13 devrp1 pengine[6848]: notice: LogActions: Recover ybrpstat:0#011(Started devrp2) Jul 9 06:16:13 devrp1 pengine[6848]: notice: process_pe_message: Calculated Transition 1054: /var/lib/pacemaker/pengine/pe-input-235.bz2 Jul 9 06:16:13 devrp1 pengine[6848]: notice: unpack_config: On loss of CCM Quorum: Ignore Jul 9 06:16:13 devrp1 pengine[6848]: warning: unpack_rsc_op: Processing failed op monitor for ybrpstat:0 on devrp2: not running (7) Jul 9 06:16:13 devrp1 pengine[6848]: warning: unpack_rsc_op: Processing failed op start for ybrpstat:1 on devrp1: unknown error (1) Jul 9 06:16:13 devrp1 pengine[6848]: warning: common_apply_stickiness: Forcing ybrpstat-clone away from devrp1 after 100 failures (max=100) Jul 9 06:16:13 devrp1 pengine[6848]: warning: common_apply_stickiness: Forcing ybrpstat-clone away from devrp1 after 100 failures (max=100) Jul 9 06:16:13 devrp1 pengine[6848]: notice: LogActions: Restart ybrpip#011(Started devrp2) Jul 9 06:16:13 devrp1 pengine[6848]: notice: LogActions: Recover ybrpstat:0#011(Started devrp2) Jul 9 06:16:13 devrp1 pengine[6848]: notice: process_pe_message: Calculated Transition 1055: /var/lib/pacemaker/pengine/pe-input-236.bz2 Jul 9 06:16:13 devrp1 pengine[6848]: notice: unpack_config: On loss of CCM Quorum: Ignore Jul 9 06:16:13 devrp1 pengine[6848]: warning: unpack_rsc_op: Processing failed op monitor for ybrpstat:0 on devrp2: not running (7) Jul 9 06:16:13 devrp1 pengine[6848]: warning: unpack_rsc_op: Processing failed op start for ybrpstat:1 on devrp1: unknown error (1) Jul 9 06:16:13 devrp1 pengine[6848]: warning: common_apply_stickiness: Forcing ybrpstat-clone away from devrp1 after 100 failures (max=100) Jul 9 06:16:13 devrp1 pengine[6848]: warning: common_apply_stickiness: Forcing ybrpstat-clone away from devrp1 after 100 failures (max=100) Jul 9 06:16:13 devrp1 pengine[6848]: notice: LogActions: Restart ybrpip#011(Started devrp2) Jul 9 06:16:13 devrp1 pengine[6848]: notice: LogActions: Recover ybrpstat:0#011(Started devrp2) And it stay this way for the next 12 hours, until I got on. I
Re: [Pacemaker] Pacemaker with Xen 4.3 problem
Hello, here the log output # 2014-07-09T10:49:01.315764+02:00 xen01 crmd[31294]: notice: crm_add_logfile: Additional logging available in /var/log/pacemaker.log 2014-07-09T10:49:01.479820+02:00 xen01 crm_verify[31299]: notice: crm_log_args: Invoked: crm_verify -V -p 2014-07-09T10:49:17.135725+02:00 xen01 crmd[31359]: notice: crm_add_logfile: Additional logging available in /var/log/pacemaker.log 2014-07-09T10:49:32.683094+02:00 xen01 crmd[31367]: notice: crm_add_logfile: Additional logging available in /var/log/pacemaker.log 2014-07-09T10:52:33.063416+02:00 xen01 crmd[31668]: notice: crm_add_logfile: Additional logging available in /var/log/pacemaker.log 2014-07-09T10:52:33.224051+02:00 xen01 crm_verify[31673]: notice: crm_log_args: Invoked: crm_verify -V -p 2014-07-09T10:52:33.378325+02:00 xen01 pengine[31686]: notice: crm_add_logfile: Additional logging available in /var/log/pacemaker.log 2014-07-09T10:52:33.466427+02:00 xen01 crmd[3446]: notice: do_state_transition: State transition S_IDLE - S_POLICY_ENGINE [ input=I_PE_CALC cause=C_FSA_INTERNAL origin=abort_transition_graph ] 2014-07-09T10:52:33.480118+02:00 xen01 pengine[3445]: notice: unpack_config: On loss of CCM Quorum: Ignore 2014-07-09T10:52:33.480151+02:00 xen01 pengine[3445]: notice: LogActions: Start dnsdhcp#011(xen02.domain.dom) 2014-07-09T10:52:33.480161+02:00 xen01 pengine[3445]: notice: process_pe_message: Calculated Transition 227: /var/lib/pacemaker/pengine/pe-input-240.bz2 2014-07-09T10:52:33.480431+02:00 xen01 crmd[3446]: notice: te_rsc_command: Initiating action 7: monitor dnsdhcp_monitor_0 on xen02.domain.dom 2014-07-09T10:52:33.481059+02:00 xen01 crmd[3446]: notice: te_rsc_command: Initiating action 5: monitor dnsdhcp_monitor_0 on xen01.domain.dom (local) 2014-07-09T10:52:33.586987+02:00 xen01 crmd[3446]: notice: process_lrm_event: Operation dnsdhcp_monitor_0: not running (node=xen01.domain.dom, call=102, rc=7, cib-update=380, confirmed=true) 2014-07-09T10:52:33.611876+02:00 xen01 crmd[3446]: notice: te_rsc_command: Initiating action 4: probe_complete probe_complete-xen01.domain.dom on xen01.domain.dom (local) - no waiting 2014-07-09T10:52:33.810913+02:00 xen01 crmd[3446]: notice: te_rsc_command: Initiating action 6: probe_complete probe_complete-xen02.domain.dom on xen02.domain.dom - no waiting 2014-07-09T10:52:33.813788+02:00 xen01 crmd[3446]: notice: te_rsc_command: Initiating action 10: start dnsdhcp_start_0 on xen02.domain.dom 2014-07-09T10:52:33.975340+02:00 xen01 crmd[3446]: warning: status_from_rc: Action 10 (dnsdhcp_start_0) on xen02.domain.dom failed (target: 0 vs. rc: 1): Error 2014-07-09T10:52:33.975412+02:00 xen01 crmd[3446]: warning: update_failcount: Updating failcount for dnsdhcp on xen02.domain.dom after failed start: rc=1 (update=INFINITY, time=1404895953) 2014-07-09T10:52:33.979271+02:00 xen01 crmd[3446]: notice: abort_transition_graph: Transition aborted by dnsdhcp_start_0 'modify' on xen02.domain.dom: Event failed (magic=0:1;10:227:0:37f37c0c-b063-4225-a380-a41137f7d460, cib=0.94.3, source=match_graph_event:344, 0) 2014-07-09T10:52:33.984242+02:00 xen01 crmd[3446]: warning: update_failcount: Updating failcount for dnsdhcp on xen02.domain.dom after failed start: rc=1 (update=INFINITY, time=1404895953) 2014-07-09T10:52:33.985790+02:00 xen01 crmd[3446]: warning: status_from_rc: Action 10 (dnsdhcp_start_0) on xen02.domain.dom failed (target: 0 vs. rc: 1): Error 2014-07-09T10:52:33.987069+02:00 xen01 crmd[3446]: warning: update_failcount: Updating failcount for dnsdhcp on xen02.domain.dom after failed start: rc=1 (update=INFINITY, time=1404895953) 2014-07-09T10:52:33.988034+02:00 xen01 crmd[3446]: warning: update_failcount: Updating failcount for dnsdhcp on xen02.domain.dom after failed start: rc=1 (update=INFINITY, time=1404895953) 2014-07-09T10:52:33.988729+02:00 xen01 crmd[3446]: notice: run_graph: Transition 227 (Complete=6, Pending=0, Fired=0, Skipped=1, Incomplete=0, Source=/var/lib/pacemaker/pengine/pe-input-240.bz2): Stopped 2014-07-09T10:52:33.989334+02:00 xen01 pengine[3445]: notice: unpack_config: On loss of CCM Quorum: Ignore 2014-07-09T10:52:33.990014+02:00 xen01 pengine[3445]: warning: unpack_rsc_op_failure: Processing failed op start for dnsdhcp on xen02.domain.dom: unknown error (1) 2014-07-09T10:52:33.990615+02:00 xen01 pengine[3445]: warning: unpack_rsc_op_failure: Processing failed op start for dnsdhcp on xen02.domain.dom: unknown error (1) 2014-07-09T10:52:33.991355+02:00 xen01 pengine[3445]: notice: LogActions: Recover dnsdhcp#011(Started xen02.domain.dom) 2014-07-09T10:52:33.992005+02:00 xen01 pengine[3445]: notice: process_pe_message: Calculated Transition 228: /var/lib/pacemaker/pengine/pe-input-241.bz2 2014-07-09T10:52:34.040477+02:00 xen01 pengine[3445]: notice:
[Pacemaker] strange error
Hi, just wanted to ask maybe someone encountered such situation. suddenly cluster fails: Jul 9 04:17:58 sdcsispprxfe1 IPaddr2(extVip51)[17292]: ERROR: Unknown interface [eth1] No such device. Jul 9 04:17:58 sdcsispprxfe1 IPaddr2(extVip51)[17292]: ERROR: [findif] failed Jul 9 04:17:58 sdcsispprxfe1 crmd[2116]: notice: process_lrm_event: LRM operation extVip51_monitor_2 (call=57, rc=6, cib-update=2151, confirmed=false) not configured Jul 9 04:17:58 sdcsispprxfe1 crmd[2116]: warning: update_failcount: Updating failcount for extVip51 on sdcsispprxfe1 after failed monitor: rc=6 (update=value++, time=1404868678) Jul 9 04:17:58 sdcsispprxfe1 crmd[2116]: notice: do_state_transition: State transition S_IDLE - S_POLICY_ENGINE [ input=I_PE_CALC cause=C_FSA_INTERNAL origin=abort_transition_graph ] Jul 9 04:17:58 sdcsispprxfe1 attrd[2114]: notice: attrd_trigger_update: Sending flush op to all hosts for: fail-count-extVip51 (1) Jul 9 04:17:58 sdcsispprxfe1 pengine[2115]: notice: unpack_config: On loss of CCM Quorum: Ignore Jul 9 04:17:58 sdcsispprxfe1 attrd[2114]: notice: attrd_perform_update: Sent update 42: fail-count-extVip51=1 Jul 9 04:17:58 sdcsispprxfe1 attrd[2114]: notice: attrd_trigger_update: Sending flush op to all hosts for: last-failure-extVip51 (1404868678) Jul 9 04:17:58 sdcsispprxfe1 pengine[2115]:error: unpack_rsc_op: Preventing extVip51 from re-starting anywhere in the cluster : operation monitor failed 'not configured' (rc=6) Jul 9 04:17:58 sdcsispprxfe1 pengine[2115]: warning: unpack_rsc_op: Processing failed op monitor for extVip51 on sdcsispprxfe1: not configured (6) restart was issued and then: IPaddr2(extVip51)[23854]: INFO: Bringing device eth1 up Version: 1.1.10-14.el6_5.3-368c726 centos 6.5 (other logs don't show eth1 going down or sthing similar) ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
[Pacemaker] pacemaker stonith No such device
Dear all, unfortunately my stonith does not work on my pacemaker cluster. If I do ifdown on the two cluster interconnect interfaces of server sv2827 the server sv2828 want to fence the server sv2827, but the messages log says:error: remote_op_done: Operation reboot of sv2827-p1 by sv2828-p1 for crmd.7979@sv2828-p1.076062f0: No such device Can somebody please help me? Jul 9 12:42:49 sv2828 corosync[7749]: [CMAN ] quorum lost, blocking activity Jul 9 12:42:49 sv2828 corosync[7749]: [QUORUM] This node is within the non-primary component and will NOT provide any services. Jul 9 12:42:49 sv2828 corosync[7749]: [QUORUM] Members[1]: 1 Jul 9 12:42:49 sv2828 corosync[7749]: [TOTEM ] A processor joined or left the membership and a new membership was formed. Jul 9 12:42:49 sv2828 crmd[7979]: notice: cman_event_callback: Membership 1492: quorum lost Jul 9 12:42:49 sv2828 crmd[7979]: notice: crm_update_peer_state: cman_event_callback: Node sv2827-p1[2] - state is now lost (was member) Jul 9 12:42:49 sv2828 crmd[7979]: warning: match_down_event: No match for shutdown action on sv2827-p1 Jul 9 12:42:49 sv2828 crmd[7979]: notice: peer_update_callback: Stonith/shutdown of sv2827-p1 not matched Jul 9 12:42:49 sv2828 kernel: dlm: closing connection to node 2 Jul 9 12:42:49 sv2828 corosync[7749]: [CPG ] chosen downlist: sender r(0) ip(192.168.2.28) r(1) ip(192.168.3.28) ; members(old:2 left:1) Jul 9 12:42:49 sv2828 crmd[7979]: notice: do_state_transition: State transition S_IDLE - S_INTEGRATION [ input=I_NODE_JOIN cause=C_FSA_INTERNAL origin=check_join_state ] Jul 9 12:42:49 sv2828 corosync[7749]: [MAIN ] Completed service synchronization, ready to provide service. Jul 9 12:42:49 sv2828 crmd[7979]: warning: match_down_event: No match for shutdown action on sv2827-p1 Jul 9 12:42:49 sv2828 crmd[7979]: notice: peer_update_callback: Stonith/shutdown of sv2827-p1 not matched Jul 9 12:42:49 sv2828 attrd[7977]: notice: attrd_local_callback: Sending full refresh (origin=crmd) Jul 9 12:42:49 sv2828 attrd[7977]: notice: attrd_trigger_update: Sending flush op to all hosts for: last-failure-MYSQLFS (1404902183) Jul 9 12:42:49 sv2828 attrd[7977]: notice: attrd_trigger_update: Sending flush op to all hosts for: probe_complete (true) Jul 9 12:42:49 sv2828 attrd[7977]: notice: attrd_trigger_update: Sending flush op to all hosts for: last-failure-MYSQL (1404901921) Jul 9 12:42:49 sv2828 pengine[7978]: notice: unpack_config: On loss of CCM Quorum: Ignore Jul 9 12:42:49 sv2828 pengine[7978]: warning: pe_fence_node: Node sv2827-p1 will be fenced because the node is no longer part of the cluster Jul 9 12:42:49 sv2828 pengine[7978]: warning: determine_online_status: Node sv2827-p1 is unclean Jul 9 12:42:49 sv2828 pengine[7978]: warning: custom_action: Action ipmi-fencing-sv2828_stop_0 on sv2827-p1 is unrunnable (offline) Jul 9 12:42:49 sv2828 pengine[7978]: warning: stage6: Scheduling Node sv2827-p1 for STONITH Jul 9 12:42:49 sv2828 pengine[7978]: notice: LogActions: Move ipmi-fencing-sv2828#011(Started sv2827-p1 - sv2828-p1) Jul 9 12:42:49 sv2828 pengine[7978]: warning: process_pe_message: Calculated Transition 38: /var/lib/pacemaker/pengine/pe-warn-28.bz2 Jul 9 12:42:49 sv2828 crmd[7979]: notice: te_fence_node: Executing reboot fencing operation (20) on sv2827-p1 (timeout=6) Jul 9 12:42:49 sv2828 stonith-ng[7975]: notice: handle_request: Client crmd.7979.6c35e3f1 wants to fence (reboot) 'sv2827-p1' with device '(any)' Jul 9 12:42:49 sv2828 stonith-ng[7975]: notice: initiate_remote_stonith_op: Initiating remote operation reboot for sv2827-p1: 076062f0-eff3-4798-a504-16c5c5666a5b (0) Jul 9 12:42:49 sv2828 stonith-ng[7975]: notice: can_fence_host_with_device: ipmi-fencing-sv2827 can not fence sv2827-p1: static-list Jul 9 12:42:49 sv2828 stonith-ng[7975]: notice: can_fence_host_with_device: ipmi-fencing-sv2828 can not fence sv2827-p1: static-list Jul 9 12:42:49 sv2828 stonith-ng[7975]:error: remote_op_done: Operation reboot of sv2827-p1 by sv2828-p1 for crmd.7979@sv2828-p1.076062f0: No such device Jul 9 12:42:49 sv2828 crmd[7979]: notice: tengine_stonith_callback: Stonith operation 8/20:38:0:71703806-8a7c-447f-a033-e3c26abd607c: No such device (-19) Jul 9 12:42:49 sv2828 crmd[7979]: notice: run_graph: Transition 38 (Complete=1, Pending=0, Fired=0, Skipped=5, Incomplete=0, Source=/var/lib/pacemaker/pengine/pe-warn-28.bz2): Stopped With the ipmitool I could test the correct work of the ipmi like power cycle. pcs status Cluster name: mysql-int-prod Last updated: Wed Jul 9 12:46:43 2014 Last change: Wed Jul 9 12:41:14 2014 via crm_resource on sv2828-p1 Stack: cman Current DC: sv2828-p1 - partition with quorum Version: 1.1.10-1.el6_4.4-368c726 2 Nodes configured 5 Resources configured Online: [ sv2827-p1 sv2828-p1 ] Full list of resources: ipmi-fencing-sv2827(stonith:fence_ipmilan):
[Pacemaker] CMAN and Pacemaker with IPv6
Dear All, I has implemented the HA on dual stack servers, Firstly, I doesn't deploy IPv6 record on DNS yet. The CMAN and PACEMAKER can work as normal. But, after I create record on DNS server, i found the error that cann't start CMAN. Are CMAN and PACEMAKER support the IPv6? Regards, T. Kittiratanachai (Te) ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] Creating a safe cluster-node shutdown script (for when UPS goes OnBattery+LowBattery)
On Tue, Jul 8, 2014, at 02:59, Andrew Beekhof wrote: On 4 Jul 2014, at 3:16 pm, Giuseppe Ragusa giuseppe.rag...@hotmail.com wrote: Hi all, I'm trying to create a script as per subject (on CentOS 6.5, CMAN+Pacemaker, only DRBD+KVM active/passive resources; SNMP-UPS monitored by NUT). Ideally I think that each node should stop (disable) all locally-running VirtualDomain resources (doing so cleanly demotes than downs the DRBD resources underneath), then put itself in standby and finally shutdown. Since the end goal is shutdown, why not just run 'pcs cluster stop' ? I thought that this action would cause communication interruption (since Corosync would be not responding to the peer) and so cause the other node to stonith us; I know that ideally the other node too should perform pcs cluster stop in short, since the same UPS powers both, but I worry about timing issues (and races) in UPS monitoring since it is a large Enterprise UPS monitored by SNMP. Furthermore I do not know what happens to running resources at pcs cluster stop: I infer from your suggestion that resources are brought down and not migrated on the other node, correct? Possibly with 'pcs cluster standby' first if you're worried that stopping the resources might take too long. I thought that pcs cluster standby would usually migrate the resources to the other node (I actually tried it and confirmed the expected behaviour); so this would risk to become a race with the timing of the other node standby, so this is why I took the hassle of explicitly and orderly stopping all locally-running resources in my script BEFORE putting the local node in standby. Pacemaker will stop everything in the required order and stop the node when done... problem solved? I thought that after a pcs cluster standby a regular shutdown -h of the operating system would cleanly bring down the cluster too, without the need for a pcs cluster stop, given that both Pacemaker and CMAN are correctly configured for automatic startup/shutdown as operating system services (SysV initscripts controlled by CentOS 6.5 Upstart, in my case). Many thanks again for your always thought-provoking and informative answers! Regards, Giuseppe On further startup, manual intervention would be required to unstandby all nodes and enable resources (nodes already in standby and resources already disabled before blackout should be manually distinguished). Is this strategy conceptually safe? Unfortunately, various searches have turned out no prior art :) This is my tentative script (consider it in the public domain): #!/bin/bash # Note: pcs cluster status still has a small bug vs. CMAN-controlled Corosync and would always return != 0 pcs status /dev/null 21 STATUS=$? # Detect if cluster is running at all on local node # TODO: detect node already in standby and bypass this if [ ${STATUS} = 0 ]; then local_node=$(cman_tool status | grep -i 'Node[[:space:]]*name:' | sed -e 's/^.*Node\s*name:\s*\([^[:space:]]*\).*$/\1/i') for local_resource in $(pcs status 2/dev/null | grep ocf::heartbeat:VirtualDomain.*${local_node}\\s*\$ | awk '{print $1}'); do pcs resource disable ${local_resource} done # TODO: each resource disabling above may return without waiting for complete stop - wait here for no more resources active? (but avoid endless loops) pcs cluster standby ${local_node} fi # Shut down gracefully anyway at the end /sbin/shutdown -h +0 Comments/suggestions/improvements are more than welcome. Many thanks in advance. Regards, Giuseppe ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org Email had 1 attachment: + signature.asc 1k (application/pgp-signature) -- Giuseppe Ragusa giuseppe.rag...@fastmail.fm ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Re: [Pacemaker] Pacemaker 1.1: cloned stonith resources require --force to be added to levels
On Tue, Jul 8, 2014, at 06:06, Andrew Beekhof wrote: On 5 Jul 2014, at 1:00 am, Giuseppe Ragusa giuseppe.rag...@hotmail.com wrote: From: and...@beekhof.net Date: Fri, 4 Jul 2014 22:50:28 +1000 To: pacemaker@oss.clusterlabs.org Subject: Re: [Pacemaker] Pacemaker 1.1: cloned stonith resources require --force to be added to levels On 4 Jul 2014, at 1:29 pm, Giuseppe Ragusa giuseppe.rag...@hotmail.com wrote: Hi all, while creating a cloned stonith resource Any particular reason you feel the need to clone it? In the end, I suppose it's only a purist mindset :) because it is a PDU whose power outlets control both nodes, so its resource should be active (and monitored) on both nodes independently. I understand that it would work anyway, leaving it not cloned and not location-constrained just as regular, dedicated stonith devices would not need to be location-constrained, right? for multi-level STONITH on a fully-up-to-date CentOS 6.5 (pacemaker-1.1.10-14.el6_5.3.x86_64): pcs cluster cib stonith_cfg pcs -f stonith_cfg stonith create pdu1 fence_apc action=off \ ipaddr=pdu1.verolengo.privatelan login=cluster passwd=test \ pcmk_host_map=cluster1.verolengo.privatelan:3,cluster1.verolengo.privatelan:4,cluster2.verolengo.privatelan:6,cluster2.verolengo.privatelan:7 \ pcmk_host_check=static-list pcmk_host_list=cluster1.verolengo.privatelan,cluster2.verolengo.privatelan op monitor interval=240s pcs -f stonith_cfg resource clone pdu1 pdu1Clone pcs -f stonith_cfg stonith level add 2 cluster1.verolengo.privatelan pdu1Clone pcs -f stonith_cfg stonith level add 2 cluster2.verolengo.privatelan pdu1Clone the last 2 lines do not succeed unless I add the option --force and even so I still get errors when issuing verify: [root@cluster1 ~]# pcs stonith level verify Error: pdu1Clone is not a stonith id If you check, I think you'll find there is no such resource as 'pdu1Clone'. I don't believe pcs lets you decide what the clone name is. You're right! (obviously ; ) It's been automatically named pdu1-clone I suppose that there's still too much crmsh in my memory :) Anyway, removing the stonith level (to start from scratch) and using the correct clone name does not change the result: [root@cluster1 etc]# pcs -f stonith_cfg stonith level add 2 cluster1.verolengo.privatelan pdu1-clone Error: pdu1-clone is not a stonith id (use --force to override) I bet we didn't think of that. What if you just do: pcs -f stonith_cfg stonith level add 2 cluster1.verolengo.privatelan pdu1 Does that work? Yes, no errors at all and verify successful. This initially passed by as a simple check for general sanity, while now, on second read, I think you were suggesting that I could clone as usual then configure with the primitive resource (which I usually avoid when working with regular clones) and it should automatically use instead the clone at runtime, correct? Remember that a full real test (to verify actual second level functionality in presence of first level failure) is still pending for both the plain and cloned setup. Apropos: I read through the list archives that stonith resources (being resources, after all) could themselves cause fencing (!) if failing (start, monitor, stop) stop just unsets a flag in stonithd. start does perform a monitor op though, which could fail. but by default only stop failure would result in fencing. I though that start-failure-is-fatal was true by default, but maybe not for stonith resources. and that an ad-hoc on-fail setting could be used to prevent that. Maybe my aforementioned naive testing procedure (pull the iLO cable) could provoke that? _shouldnt_ do so Would you suggest to configure such an on-fail option? again, shouldn't be necessary Thanks again. Regards, Giuseppe Many thanks again for your help (and all your valuable work, of course!). Regards, Giuseppe ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org Email had 1 attachment: + signature.asc 1k (application/pgp-signature) -- Giuseppe Ragusa
Re: [Pacemaker] Pacemaker 1.1: cloned stonith resources require --force to be added to levels
On 9 Jul 2014, at 10:43 pm, Giuseppe Ragusa giuseppe.rag...@hotmail.com wrote: On Tue, Jul 8, 2014, at 06:06, Andrew Beekhof wrote: On 5 Jul 2014, at 1:00 am, Giuseppe Ragusa giuseppe.rag...@hotmail.com wrote: From: and...@beekhof.net Date: Fri, 4 Jul 2014 22:50:28 +1000 To: pacemaker@oss.clusterlabs.org Subject: Re: [Pacemaker] Pacemaker 1.1: cloned stonith resources require --force to be added to levels On 4 Jul 2014, at 1:29 pm, Giuseppe Ragusa giuseppe.rag...@hotmail.com wrote: Hi all, while creating a cloned stonith resource Any particular reason you feel the need to clone it? In the end, I suppose it's only a purist mindset :) because it is a PDU whose power outlets control both nodes, so its resource should be active (and monitored) on both nodes independently. I understand that it would work anyway, leaving it not cloned and not location-constrained just as regular, dedicated stonith devices would not need to be location-constrained, right? for multi-level STONITH on a fully-up-to-date CentOS 6.5 (pacemaker-1.1.10-14.el6_5.3.x86_64): pcs cluster cib stonith_cfg pcs -f stonith_cfg stonith create pdu1 fence_apc action=off \ ipaddr=pdu1.verolengo.privatelan login=cluster passwd=test \ pcmk_host_map=cluster1.verolengo.privatelan:3,cluster1.verolengo.privatelan:4,cluster2.verolengo.privatelan:6,cluster2.verolengo.privatelan:7 \ pcmk_host_check=static-list pcmk_host_list=cluster1.verolengo.privatelan,cluster2.verolengo.privatelan op monitor interval=240s pcs -f stonith_cfg resource clone pdu1 pdu1Clone pcs -f stonith_cfg stonith level add 2 cluster1.verolengo.privatelan pdu1Clone pcs -f stonith_cfg stonith level add 2 cluster2.verolengo.privatelan pdu1Clone the last 2 lines do not succeed unless I add the option --force and even so I still get errors when issuing verify: [root@cluster1 ~]# pcs stonith level verify Error: pdu1Clone is not a stonith id If you check, I think you'll find there is no such resource as 'pdu1Clone'. I don't believe pcs lets you decide what the clone name is. You're right! (obviously ; ) It's been automatically named pdu1-clone I suppose that there's still too much crmsh in my memory :) Anyway, removing the stonith level (to start from scratch) and using the correct clone name does not change the result: [root@cluster1 etc]# pcs -f stonith_cfg stonith level add 2 cluster1.verolengo.privatelan pdu1-clone Error: pdu1-clone is not a stonith id (use --force to override) I bet we didn't think of that. What if you just do: pcs -f stonith_cfg stonith level add 2 cluster1.verolengo.privatelan pdu1 Does that work? Yes, no errors at all and verify successful. This initially passed by as a simple check for general sanity, while now, on second read, I think you were suggesting that I could clone as usual then configure with the primitive resource (which I usually avoid when working with regular clones) and it should automatically use instead the clone at runtime, correct? right. but also consider not cloning it at all :) Remember that a full real test (to verify actual second level functionality in presence of first level failure) is still pending for both the plain and cloned setup. Apropos: I read through the list archives that stonith resources (being resources, after all) could themselves cause fencing (!) if failing (start, monitor, stop) stop just unsets a flag in stonithd. start does perform a monitor op though, which could fail. but by default only stop failure would result in fencing. I though that start-failure-is-fatal was true by default, but maybe not for stonith resources. fatal in the sense of won't attempt to run it there again, not the fence the whole node way and that an ad-hoc on-fail setting could be used to prevent that. Maybe my aforementioned naive testing procedure (pull the iLO cable) could provoke that? _shouldnt_ do so Would you suggest to configure such an on-fail option? again, shouldn't be necessary Thanks again. Regards, Giuseppe Many thanks again for your help (and all your valuable work, of course!). Regards, Giuseppe ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org Email had 1 attachment: + signature.asc 1k
Re: [Pacemaker] Creating a safe cluster-node shutdown script (for when UPS goes OnBattery+LowBattery)
On 9 Jul 2014, at 10:28 pm, Giuseppe Ragusa giuseppe.rag...@hotmail.com wrote: On Tue, Jul 8, 2014, at 02:59, Andrew Beekhof wrote: On 4 Jul 2014, at 3:16 pm, Giuseppe Ragusa giuseppe.rag...@hotmail.com wrote: Hi all, I'm trying to create a script as per subject (on CentOS 6.5, CMAN+Pacemaker, only DRBD+KVM active/passive resources; SNMP-UPS monitored by NUT). Ideally I think that each node should stop (disable) all locally-running VirtualDomain resources (doing so cleanly demotes than downs the DRBD resources underneath), then put itself in standby and finally shutdown. Since the end goal is shutdown, why not just run 'pcs cluster stop' ? I thought that this action would cause communication interruption (since Corosync would be not responding to the peer) and so cause the other node to stonith us; No. Shutdown is a globally co-ordinated process. We don't fence nodes we know shut down cleanly. I know that ideally the other node too should perform pcs cluster stop in short, since the same UPS powers both, but I worry about timing issues (and races) in UPS monitoring since it is a large Enterprise UPS monitored by SNMP. Furthermore I do not know what happens to running resources at pcs cluster stop: I infer from your suggestion that resources are brought down and not migrated on the other node, correct? If the other node is shutting down too, they'll simply be stopped. Otherwise we'll try to move them. Possibly with 'pcs cluster standby' first if you're worried that stopping the resources might take too long. I thought that pcs cluster standby would usually migrate the resources to the other node (I actually tried it and confirmed the expected behaviour); so this would risk to become a race with the timing of the other node standby, Not really, at the point the second node runs 'standby' we'll stop trying to migrate services and just stop them everywhere. Again, this is a centrally controlled process, timing isn't a problem. so this is why I took the hassle of explicitly and orderly stopping all locally-running resources in my script BEFORE putting the local node in standby. Pacemaker will stop everything in the required order and stop the node when done... problem solved? I thought that after a pcs cluster standby a regular shutdown -h of the operating system would cleanly bring down the cluster too, It should do without the need for a pcs cluster stop, given that both Pacemaker and CMAN are correctly configured for automatic startup/shutdown as operating system services (SysV initscripts controlled by CentOS 6.5 Upstart, in my case). Many thanks again for your always thought-provoking and informative answers! Regards, Giuseppe On further startup, manual intervention would be required to unstandby all nodes and enable resources (nodes already in standby and resources already disabled before blackout should be manually distinguished). Is this strategy conceptually safe? Unfortunately, various searches have turned out no prior art :) This is my tentative script (consider it in the public domain): #!/bin/bash # Note: pcs cluster status still has a small bug vs. CMAN-controlled Corosync and would always return != 0 pcs status /dev/null 21 STATUS=$? # Detect if cluster is running at all on local node # TODO: detect node already in standby and bypass this if [ ${STATUS} = 0 ]; then local_node=$(cman_tool status | grep -i 'Node[[:space:]]*name:' | sed -e 's/^.*Node\s*name:\s*\([^[:space:]]*\).*$/\1/i') for local_resource in $(pcs status 2/dev/null | grep ocf::heartbeat:VirtualDomain.*${local_node}\\s*\$ | awk '{print $1}'); do pcs resource disable ${local_resource} done # TODO: each resource disabling above may return without waiting for complete stop - wait here for no more resources active? (but avoid endless loops) pcs cluster standby ${local_node} fi # Shut down gracefully anyway at the end /sbin/shutdown -h +0 Comments/suggestions/improvements are more than welcome. Many thanks in advance. Regards, Giuseppe ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started:
Re: [Pacemaker] CMAN and Pacemaker with IPv6
On 9 Jul 2014, at 9:15 pm, Teerapatr Kittiratanachai maillist...@gmail.com wrote: Dear All, I has implemented the HA on dual stack servers, Firstly, I doesn't deploy IPv6 record on DNS yet. The CMAN and PACEMAKER can work as normal. But, after I create record on DNS server, i found the error that cann't start CMAN. Are CMAN and PACEMAKER support the IPv6? I don;t think pacemaker cares. What errors did you get? signature.asc Description: Message signed with OpenPGP using GPGMail ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] Cannot create more than 27 multistate resources
On 9 Jul 2014, at 6:49 pm, K Mehta kiranmehta1...@gmail.com wrote: Hi, [root@vsanqa11 ~]# rpm -qa | grep pcs ; rpm -qa | grep pace ; rpm -qa | grep libqb; rpm -qa | grep coro; rpm -qa | grep cman pcs-0.9.90-2.el6.centos.2.noarch pacemaker-cli-1.1.10-14.el6_5.3.x86_64 pacemaker-libs-1.1.10-14.el6_5.3.x86_64 pacemaker-1.1.10-14.el6_5.3.x86_64 pacemaker-cluster-libs-1.1.10-14.el6_5.3.x86_64 libqb-devel-0.16.0-2.el6.x86_64 libqb-0.16.0-2.el6.x86_64 corosynclib-1.4.1-17.el6_5.1.x86_64 corosync-1.4.1-17.el6_5.1.x86_64 cman-3.0.12.1-59.el6_5.2.x86_64 [root@vsanqa11 ~]# uname -a Linux vsanqa11 2.6.32-279.el6.x86_64 #1 SMP Fri Jun 22 12:19:21 UTC 2012 x86_64 x86_64 x86_64 GNU/Linux [root@vsanqa11 ~]# cat /etc/redhat-release CentOS release 6.3 (Final) Created 27 resources [root@vsanqa11 ~]# pcs status Cluster name: vsanqa11_12 Last updated: Wed Jul 9 01:30:05 2014 Last change: Wed Jul 9 01:24:12 2014 via cibadmin on vsanqa11 Stack: cman Current DC: vsanqa11 - partition with quorum Version: 1.1.10-14.el6_5.3-368c726 2 Nodes configured 54 Resources configured Online: [ vsanqa11 vsanqa12 ] Full list of resources: Master/Slave Set: ms-0feba438-0f16-4de5-bf18-9d576cf4dd26 [vha-0feba438-0f16-4de5-bf18-9d576cf4dd26] Masters: [ vsanqa12 ] Slaves: [ vsanqa11 ] Master/Slave Set: ms-329ad1bd-2f9c-483d-a052-270731aefd70 [vha-329ad1bd-2f9c-483d-a052-270731aefd70] Masters: [ vsanqa12 ] Slaves: [ vsanqa11 ] Master/Slave Set: ms-b5b9e8dc-87c9-4229-b425-870d1bc1f107 [vha-b5b9e8dc-87c9-4229-b425-870d1bc1f107] Masters: [ vsanqa12 ] Slaves: [ vsanqa11 ] Master/Slave Set: ms-56112cc8-4c56-4454-84ea-ba9b7b82dde7 [vha-56112cc8-4c56-4454-84ea-ba9b7b82dde7] Masters: [ vsanqa12 ] Slaves: [ vsanqa11 ] Master/Slave Set: ms-20fd5ed6-8e72-4a6b-9b1a-73bd96e1f253 [vha-20fd5ed6-8e72-4a6b-9b1a-73bd96e1f253] Masters: [ vsanqa12 ] Slaves: [ vsanqa11 ] Master/Slave Set: ms-77af4d7e-d78c-4799-a24a-7536d225 [vha-77af4d7e-d78c-4799-a24a-7536d225] Masters: [ vsanqa12 ] Slaves: [ vsanqa11 ] Master/Slave Set: ms-913f6e7e-932f-4c3f-9e7f-a70c4153b4c7 [vha-913f6e7e-932f-4c3f-9e7f-a70c4153b4c7] Masters: [ vsanqa12 ] Slaves: [ vsanqa11 ] Master/Slave Set: ms-3d3ae48c-1955-4c64-b951-f2a2621a70b7 [vha-3d3ae48c-1955-4c64-b951-f2a2621a70b7] Masters: [ vsanqa12 ] Slaves: [ vsanqa11 ] Master/Slave Set: ms-a8101da1-636f-483b-90a9-9a18fd5a5793 [vha-a8101da1-636f-483b-90a9-9a18fd5a5793] Masters: [ vsanqa12 ] Slaves: [ vsanqa11 ] Master/Slave Set: ms-290e0fdd-19c6-4b40-8046-0fa2c94a7320 [vha-290e0fdd-19c6-4b40-8046-0fa2c94a7320] Masters: [ vsanqa12 ] Slaves: [ vsanqa11 ] Master/Slave Set: ms-8341368b-6b6a-4bf0-bb0a-38b32ffec2f4 [vha-8341368b-6b6a-4bf0-bb0a-38b32ffec2f4] Masters: [ vsanqa12 ] Slaves: [ vsanqa11 ] Master/Slave Set: ms-6695f479-0d21-4388-9540-a440c32e0944 [vha-6695f479-0d21-4388-9540-a440c32e0944] Masters: [ vsanqa12 ] Slaves: [ vsanqa11 ] Master/Slave Set: ms-85fe517e-9a2b-4ac1-b6e1-e4da57b49969 [vha-85fe517e-9a2b-4ac1-b6e1-e4da57b49969] Masters: [ vsanqa12 ] Slaves: [ vsanqa11 ] Master/Slave Set: ms-5b9adf5c-3bb5-4353-8c5f-a3e1508f7c4a [vha-5b9adf5c-3bb5-4353-8c5f-a3e1508f7c4a] Masters: [ vsanqa12 ] Slaves: [ vsanqa11 ] Master/Slave Set: ms-57b4b7a1-d3bb-4b94-b8dc-8a84e4a3de03 [vha-57b4b7a1-d3bb-4b94-b8dc-8a84e4a3de03] Masters: [ vsanqa12 ] Slaves: [ vsanqa11 ] Master/Slave Set: ms-b4fbac14-ef19-4861-93e5-14574e101484 [vha-b4fbac14-ef19-4861-93e5-14574e101484] Masters: [ vsanqa12 ] Slaves: [ vsanqa11 ] Master/Slave Set: ms-ec4836d3-b768-4abd-83bf-3a61143346ce [vha-ec4836d3-b768-4abd-83bf-3a61143346ce] Masters: [ vsanqa12 ] Slaves: [ vsanqa11 ] Master/Slave Set: ms-28d6db1a-bdba-4fc0-b74c-ef74548cf714 [vha-28d6db1a-bdba-4fc0-b74c-ef74548cf714] Masters: [ vsanqa12 ] Slaves: [ vsanqa11 ] Master/Slave Set: ms-c6cfd538-17f4-4c4b-b259-731e2cac75f3 [vha-c6cfd538-17f4-4c4b-b259-731e2cac75f3] Masters: [ vsanqa12 ] Slaves: [ vsanqa11 ] Master/Slave Set: ms-06d19988-7549-4882-b780-fd598714ec7f [vha-06d19988-7549-4882-b780-fd598714ec7f] Masters: [ vsanqa12 ] Slaves: [ vsanqa11 ] Master/Slave Set: ms-ec105b52-42b1-42f5-931f-249fbc2f16c4 [vha-ec105b52-42b1-42f5-931f-249fbc2f16c4] Masters: [ vsanqa12 ] Slaves: [ vsanqa11 ] Master/Slave Set: ms-6cc9a97b-9714-4cab-bf52-103c46b8593f [vha-6cc9a97b-9714-4cab-bf52-103c46b8593f] Masters: [ vsanqa12 ] Slaves: [ vsanqa11 ] Master/Slave Set: ms-b80b4a69-67e2-43bd-8eba-f6c18f78d706 [vha-b80b4a69-67e2-43bd-8eba-f6c18f78d706] Masters: [ vsanqa12 ] Slaves: [ vsanqa11 ] Master/Slave Set: ms-24817d3c-98e0-4408-9bce-67d8bb2495db [vha-24817d3c-98e0-4408-9bce-67d8bb2495db] Masters: [ vsanqa12 ] Slaves:
Re: [Pacemaker] strange error
Is NetworkManager present? Using dhcp for that interface? On 9 Jul 2014, at 7:03 pm, divinesecret arvy...@artogama.lt wrote: Hi, just wanted to ask maybe someone encountered such situation. suddenly cluster fails: Jul 9 04:17:58 sdcsispprxfe1 IPaddr2(extVip51)[17292]: ERROR: Unknown interface [eth1] No such device. Jul 9 04:17:58 sdcsispprxfe1 IPaddr2(extVip51)[17292]: ERROR: [findif] failed Jul 9 04:17:58 sdcsispprxfe1 crmd[2116]: notice: process_lrm_event: LRM operation extVip51_monitor_2 (call=57, rc=6, cib-update=2151, confirmed=false) not configured Jul 9 04:17:58 sdcsispprxfe1 crmd[2116]: warning: update_failcount: Updating failcount for extVip51 on sdcsispprxfe1 after failed monitor: rc=6 (update=value++, time=1404868678) Jul 9 04:17:58 sdcsispprxfe1 crmd[2116]: notice: do_state_transition: State transition S_IDLE - S_POLICY_ENGINE [ input=I_PE_CALC cause=C_FSA_INTERNAL origin=abort_transition_graph ] Jul 9 04:17:58 sdcsispprxfe1 attrd[2114]: notice: attrd_trigger_update: Sending flush op to all hosts for: fail-count-extVip51 (1) Jul 9 04:17:58 sdcsispprxfe1 pengine[2115]: notice: unpack_config: On loss of CCM Quorum: Ignore Jul 9 04:17:58 sdcsispprxfe1 attrd[2114]: notice: attrd_perform_update: Sent update 42: fail-count-extVip51=1 Jul 9 04:17:58 sdcsispprxfe1 attrd[2114]: notice: attrd_trigger_update: Sending flush op to all hosts for: last-failure-extVip51 (1404868678) Jul 9 04:17:58 sdcsispprxfe1 pengine[2115]:error: unpack_rsc_op: Preventing extVip51 from re-starting anywhere in the cluster : operation monitor failed 'not configured' (rc=6) Jul 9 04:17:58 sdcsispprxfe1 pengine[2115]: warning: unpack_rsc_op: Processing failed op monitor for extVip51 on sdcsispprxfe1: not configured (6) restart was issued and then: IPaddr2(extVip51)[23854]: INFO: Bringing device eth1 up Version: 1.1.10-14.el6_5.3-368c726 centos 6.5 (other logs don't show eth1 going down or sthing similar) ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org signature.asc Description: Message signed with OpenPGP using GPGMail ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] CMAN and Pacemaker with IPv6
I not found any LOG message /var/log/messages ... Jul 10 07:44:19 nwh00 kernel: : DLM (built Jun 19 2014 21:16:01) installed Jul 10 07:44:22 nwh00 pacemaker: Aborting startup of Pacemaker Cluster Manager ... and this is what display when I try to start pacemaker # /etc/init.d/pacemaker start Starting cluster: Checking if cluster has been disabled at boot...[ OK ] Checking Network Manager... [ OK ] Global setup... [ OK ] Loading kernel modules... [ OK ] Mounting configfs...[ OK ] Starting cman... Cannot find node name in cluster.conf Unable to get the configuration Cannot find node name in cluster.conf cman_tool: corosync daemon didn't start Check cluster logs for details [FAILED] Stopping cluster: Leaving fence domain... [ OK ] Stopping gfs_controld...[ OK ] Stopping dlm_controld...[ OK ] Stopping fenced... [ OK ] Stopping cman...[ OK ] Unloading kernel modules... [ OK ] Unmounting configfs... [ OK ] Aborting startup of Pacemaker Cluster Manager another one thing, according to the happened problem, I remove the record from DNS for now and map it in to /etc/hosts files instead, as shown below. /etc/hosts ... 2001:db8:0:1::1 node0.example.com 2001:db8:0:1::2 node1.example.com ... Is there any configure that help me to got more log ? On Thu, Jul 10, 2014 at 5:06 AM, Andrew Beekhof and...@beekhof.net wrote: On 9 Jul 2014, at 9:15 pm, Teerapatr Kittiratanachai maillist...@gmail.com wrote: Dear All, I has implemented the HA on dual stack servers, Firstly, I doesn't deploy IPv6 record on DNS yet. The CMAN and PACEMAKER can work as normal. But, after I create record on DNS server, i found the error that cann't start CMAN. Are CMAN and PACEMAKER support the IPv6? I don;t think pacemaker cares. What errors did you get? ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] Pacemaker 1.1: cloned stonith resources require --force to be added to levels
On Thu, Jul 10, 2014, at 00:00, Andrew Beekhof wrote: On 9 Jul 2014, at 10:43 pm, Giuseppe Ragusa giuseppe.rag...@hotmail.com wrote: On Tue, Jul 8, 2014, at 06:06, Andrew Beekhof wrote: On 5 Jul 2014, at 1:00 am, Giuseppe Ragusa giuseppe.rag...@hotmail.com wrote: From: and...@beekhof.net Date: Fri, 4 Jul 2014 22:50:28 +1000 To: pacemaker@oss.clusterlabs.org Subject: Re: [Pacemaker] Pacemaker 1.1: cloned stonith resources require --force to be added to levels On 4 Jul 2014, at 1:29 pm, Giuseppe Ragusa giuseppe.rag...@hotmail.com wrote: Hi all, while creating a cloned stonith resource Any particular reason you feel the need to clone it? In the end, I suppose it's only a purist mindset :) because it is a PDU whose power outlets control both nodes, so its resource should be active (and monitored) on both nodes independently. I understand that it would work anyway, leaving it not cloned and not location-constrained just as regular, dedicated stonith devices would not need to be location-constrained, right? for multi-level STONITH on a fully-up-to-date CentOS 6.5 (pacemaker-1.1.10-14.el6_5.3.x86_64): pcs cluster cib stonith_cfg pcs -f stonith_cfg stonith create pdu1 fence_apc action=off \ ipaddr=pdu1.verolengo.privatelan login=cluster passwd=test \ pcmk_host_map=cluster1.verolengo.privatelan:3,cluster1.verolengo.privatelan:4,cluster2.verolengo.privatelan:6,cluster2.verolengo.privatelan:7 \ pcmk_host_check=static-list pcmk_host_list=cluster1.verolengo.privatelan,cluster2.verolengo.privatelan op monitor interval=240s pcs -f stonith_cfg resource clone pdu1 pdu1Clone pcs -f stonith_cfg stonith level add 2 cluster1.verolengo.privatelan pdu1Clone pcs -f stonith_cfg stonith level add 2 cluster2.verolengo.privatelan pdu1Clone the last 2 lines do not succeed unless I add the option --force and even so I still get errors when issuing verify: [root@cluster1 ~]# pcs stonith level verify Error: pdu1Clone is not a stonith id If you check, I think you'll find there is no such resource as 'pdu1Clone'. I don't believe pcs lets you decide what the clone name is. You're right! (obviously ; ) It's been automatically named pdu1-clone I suppose that there's still too much crmsh in my memory :) Anyway, removing the stonith level (to start from scratch) and using the correct clone name does not change the result: [root@cluster1 etc]# pcs -f stonith_cfg stonith level add 2 cluster1.verolengo.privatelan pdu1-clone Error: pdu1-clone is not a stonith id (use --force to override) I bet we didn't think of that. What if you just do: pcs -f stonith_cfg stonith level add 2 cluster1.verolengo.privatelan pdu1 Does that work? Yes, no errors at all and verify successful. This initially passed by as a simple check for general sanity, while now, on second read, I think you were suggesting that I could clone as usual then configure with the primitive resource (which I usually avoid when working with regular clones) and it should automatically use instead the clone at runtime, correct? right. but also consider not cloning it at all :) I understand that in your opinion there's almost no added value to cloned stonith resources, so I suppose that should a PDU-type resource happen to be running on the same node that it must now fence, it would be migrated first or something like that (since I understand that stonith resources cannot fence the node they are running on), right? If it is so and there's no adverse effect whatsoever (not even a significant delay), I will promptly remove the clone and configure my second levels using the primitive PDU stonith resource, but if on the contrary you after all think that there could be some legitimate use for such clones, I could open an RFE in bugzilla for them to be recognized as stonith resources and used in forming levels (if you suggest so). Anyway, many thanks for you advice and insight, obviously :) Remember that a full real test (to verify actual second level functionality in presence of first level failure) is still pending for both the plain and cloned setup. Apropos: I read through the list archives that stonith resources (being resources, after all) could themselves cause fencing (!) if failing (start, monitor, stop) stop just unsets a flag in stonithd. start does perform a monitor op though, which could fail. but by default only stop failure would result in fencing. I though that start-failure-is-fatal was true by default, but maybe not for stonith resources. fatal in the sense of won't attempt to run it there again, not the fence the whole node way Ah right, I remember now all the suggestions I found about migration-threshold,
Re: [Pacemaker] Creating a safe cluster-node shutdown script (for when UPS goes OnBattery+LowBattery)
On Thu, Jul 10, 2014, at 00:06, Andrew Beekhof wrote: On 9 Jul 2014, at 10:28 pm, Giuseppe Ragusa giuseppe.rag...@hotmail.com wrote: On Tue, Jul 8, 2014, at 02:59, Andrew Beekhof wrote: On 4 Jul 2014, at 3:16 pm, Giuseppe Ragusa giuseppe.rag...@hotmail.com wrote: Hi all, I'm trying to create a script as per subject (on CentOS 6.5, CMAN+Pacemaker, only DRBD+KVM active/passive resources; SNMP-UPS monitored by NUT). Ideally I think that each node should stop (disable) all locally-running VirtualDomain resources (doing so cleanly demotes than downs the DRBD resources underneath), then put itself in standby and finally shutdown. Since the end goal is shutdown, why not just run 'pcs cluster stop' ? I thought that this action would cause communication interruption (since Corosync would be not responding to the peer) and so cause the other node to stonith us; No. Shutdown is a globally co-ordinated process. We don't fence nodes we know shut down cleanly. Thanks for the clarification. Now that you said it, it seems also logical and even obvious ; I know that ideally the other node too should perform pcs cluster stop in short, since the same UPS powers both, but I worry about timing issues (and races) in UPS monitoring since it is a large Enterprise UPS monitored by SNMP. Furthermore I do not know what happens to running resources at pcs cluster stop: I infer from your suggestion that resources are brought down and not migrated on the other node, correct? If the other node is shutting down too, they'll simply be stopped. Otherwise we'll try to move them. It's the moving that worries me :) Possibly with 'pcs cluster standby' first if you're worried that stopping the resources might take too long. I forgot to ask: in which way would a previous standby make the resources stop sooner? I thought that pcs cluster standby would usually migrate the resources to the other node (I actually tried it and confirmed the expected behaviour); so this would risk to become a race with the timing of the other node standby, Not really, at the point the second node runs 'standby' we'll stop trying to migrate services and just stop them everywhere. Again, this is a centrally controlled process, timing isn't a problem. I understand that, eventually, timing won't be a problem and resources will eventually stop, but from your description I'm afraid that some delaying could result in the total shutdown process, arising from possibly unsynchronized UPS notifications on the nodes (first node starts standby, resources start to move, THEN second node starts standby). So now I'm taking your advice and I'll modify the script to user cluster stop but, with the aim of avoiding the aforementioned delay (if it actually represents a possibility), I would like to ask you three questions: *) if I simply issue a pcs cluster stop --all from the first node that gets notified of UPS critical status, do I risk any adverse effect when the other node asynchronously gives the same command some time later (before/after the whole cluster stop sequence completes)? *) does the aforementioned pcs cluster stop --all command return only after the cluster stop sequence has actually/completely ended (so as to safely issue a shutdown -h now immediately afterwards)? *) is the pcs cluster stop --all command known to work reliably on current CentOS 6.5? (I ask since I found some discussion around pcs cluster start related bugs) Many thanks again for your invaluable help and insight. Regards, Giuseppe so this is why I took the hassle of explicitly and orderly stopping all locally-running resources in my script BEFORE putting the local node in standby. Pacemaker will stop everything in the required order and stop the node when done... problem solved? I thought that after a pcs cluster standby a regular shutdown -h of the operating system would cleanly bring down the cluster too, It should do without the need for a pcs cluster stop, given that both Pacemaker and CMAN are correctly configured for automatic startup/shutdown as operating system services (SysV initscripts controlled by CentOS 6.5 Upstart, in my case). Many thanks again for your always thought-provoking and informative answers! Regards, Giuseppe On further startup, manual intervention would be required to unstandby all nodes and enable resources (nodes already in standby and resources already disabled before blackout should be manually distinguished). Is this strategy conceptually safe? Unfortunately, various searches have turned out no prior art :) This is my tentative script (consider it in the public domain): #!/bin/bash # Note: pcs cluster status still
Re: [Pacemaker] CMAN and Pacemaker with IPv6
OK, some problems are solved. I use the incorrect hostname. For now, the new problem has occured. Starting cman... Node address family does not match multicast address family Unable to get the configuration Node address family does not match multicast address family cman_tool: corosync daemon didn't start Check cluster logs for details [FAILED] How can i fix it? Or just assigned the multicast address in the configuration? Regards, Te On Thu, Jul 10, 2014 at 7:52 AM, Teerapatr Kittiratanachai maillist...@gmail.com wrote: I not found any LOG message /var/log/messages ... Jul 10 07:44:19 nwh00 kernel: : DLM (built Jun 19 2014 21:16:01) installed Jul 10 07:44:22 nwh00 pacemaker: Aborting startup of Pacemaker Cluster Manager ... and this is what display when I try to start pacemaker # /etc/init.d/pacemaker start Starting cluster: Checking if cluster has been disabled at boot...[ OK ] Checking Network Manager... [ OK ] Global setup... [ OK ] Loading kernel modules... [ OK ] Mounting configfs...[ OK ] Starting cman... Cannot find node name in cluster.conf Unable to get the configuration Cannot find node name in cluster.conf cman_tool: corosync daemon didn't start Check cluster logs for details [FAILED] Stopping cluster: Leaving fence domain... [ OK ] Stopping gfs_controld...[ OK ] Stopping dlm_controld...[ OK ] Stopping fenced... [ OK ] Stopping cman...[ OK ] Unloading kernel modules... [ OK ] Unmounting configfs... [ OK ] Aborting startup of Pacemaker Cluster Manager another one thing, according to the happened problem, I remove the record from DNS for now and map it in to /etc/hosts files instead, as shown below. /etc/hosts ... 2001:db8:0:1::1 node0.example.com 2001:db8:0:1::2 node1.example.com ... Is there any configure that help me to got more log ? On Thu, Jul 10, 2014 at 5:06 AM, Andrew Beekhof and...@beekhof.net wrote: On 9 Jul 2014, at 9:15 pm, Teerapatr Kittiratanachai maillist...@gmail.com wrote: Dear All, I has implemented the HA on dual stack servers, Firstly, I doesn't deploy IPv6 record on DNS yet. The CMAN and PACEMAKER can work as normal. But, after I create record on DNS server, i found the error that cann't start CMAN. Are CMAN and PACEMAKER support the IPv6? I don;t think pacemaker cares. What errors did you get? ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org