Re: [Pacemaker] Loss of ocf:pacemaker:ping target forces resources to restart?
Hello How do you configure your cluster network? are you using a private network for the cluster and one public for the services? 2013/5/15 Andrew Widdersheim awiddersh...@hotmail.com Sorry to bring up old issues but I am having the exact same problem as the original poster. A simultaneous disconnect on my two node cluster causes the resources to start to transition to the other node but mid flight the transition is aborted and resources are started again on the original node when the cluster realizes connectivity is same between the two nodes. I have tried various dampen settings without having any luck. Seems like the nodes report the outages at slightly different times which results in a partial transition of resources instead of waiting to know the connectivity of all of the nodes in the cluster before taking action which is what I would have thought dampen would help solve. Ideally the cluster wouldn't start the transition if another cluster node is having a connectivity issue as well and connectivity status is shared between all cluster nodes. Find my configuration below. Let me know there is something I can change to fix or if this behavior is expected. primitive p_drbd ocf:linbit:drbd \ params drbd_resource=r1 \ op monitor interval=30s role=Slave \ op monitor interval=10s role=Master primitive p_fs ocf:heartbeat:Filesystem \ params device=/dev/drbd/by-res/r1 directory=/drbd/r1 fstype=ext4 options=noatime \ op start interval=0 timeout=60s \ op stop interval=0 timeout=180s \ op monitor interval=30s timeout=40s primitive p_mysql ocf:heartbeat:mysql \ params binary=/usr/libexec/mysqld config=/drbd/r1/mysql/my.cnf datadir=/drbd/r1/mysql \ op start interval=0 timeout=120s \ op stop interval=0 timeout=120s \ op monitor interval=30s \ meta target-role=Started primitive p_ping ocf:pacemaker:ping \ params host_list=192.168.5.1 dampen=30s multiplier=1000 debug=true \ op start interval=0 timeout=60s \ op stop interval=0 timeout=60s \ op monitor interval=5s timeout=10s group g_mysql_group p_fs p_mysql \ meta target-role=Started ms ms_drbd p_drbd \ meta notify=true master-max=1 clone-max=2 target-role=Started clone cl_ping p_ping location l_connected g_mysql \ rule $id=l_connected-rule pingd: defined pingd colocation c_mysql_on_drbd inf: g_mysql ms_drbd:Master order o_drbd_before_mysql inf: ms_drbd:promote g_mysql:start property $id=cib-bootstrap-options \ dc-version=1.1.6-1.el6-8b6c6b9b6dc2627713f870850d20163fad4cc2a2 \ cluster-infrastructure=Heartbeat \ no-quorum-policy=ignore \ stonith-enabled=false \ cluster-recheck-interval=5m \ last-lrm-refresh=1368632470 rsc_defaults $id=rsc-options \ migration-threshold=5 \ resource-stickiness=200 ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org -- esta es mi vida e me la vivo hasta que dios quiera ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
[Pacemaker] pcs/crmsh Cheat sheet
By popular request, I've taken a stab at a cheat-sheet for those switching between pcs and crmsh. https://github.com/ClusterLabs/pacemaker/blob/master/doc/pcs-crmsh-quick-ref.md Any and all assistance expanding it and ensuring it is accurate will be gratefully received. -- Andrew ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] [Question and Problem] In vSphere5.1 environment, IO blocking of pengine occurs at the time of shared disk trouble for a long time.
On 16/05/2013, at 3:49 PM, Vladislav Bogdanov bub...@hoster-ok.com wrote: 16.05.2013 02:46, Andrew Beekhof wrote: On 15/05/2013, at 6:44 PM, Vladislav Bogdanov bub...@hoster-ok.com wrote: 15.05.2013 11:18, Andrew Beekhof wrote: On 15/05/2013, at 5:31 PM, Vladislav Bogdanov bub...@hoster-ok.com wrote: 15.05.2013 10:25, Andrew Beekhof wrote: On 15/05/2013, at 3:50 PM, Vladislav Bogdanov bub...@hoster-ok.com wrote: 15.05.2013 08:23, Andrew Beekhof wrote: On 15/05/2013, at 3:11 PM, renayama19661...@ybb.ne.jp wrote: Hi Andrew, Thank you for comments. The guest located it to the shared disk. What is on the shared disk? The whole OS or app-specific data (i.e. nothing pacemaker needs directly)? Shared disk has all the OS and the all data. Oh. I can imagine that being problematic. Pacemaker really isn't designed to function without disk access. You might be able to get away with it if you turn off saving PE files to disk though. I store CIB and PE files to tmpfs, and sync them to remote storage (CIFS) with lsyncd level 1 config (I may share it on request). It copies critical data like cib.xml, and moves everything else, symlinking it to original place. The same technique may apply here, but with local fs instead of cifs. Btw, the following patch is needed for that, otherwise pacemaker overwrites remote files instead of creating new ones on tmpfs: --- a/lib/common/xml.c 2011-02-11 11:42:37.0 +0100 +++ b/lib/common/xml.c 2011-02-24 15:07:48.541870829 +0100 @@ -529,6 +529,8 @@ write_file(const char *string, const char *filename) return -1; } +unlink(filename); Seems like it should be safe to include for normal operation. Exactly. Small flaw in that logic... write_file() is not used anywhere. Heh, thanks for spotting this. I recall write_file() was used for pengine, but some other function for CIB. You probably optimized that but forgot to remove unused function, that's why I was sure patch is still valid. And I did tests (CIFS storage outage simulation) only after initial patch, but not last years, that's why I didn't notice the regression - storage uses pacemaker too ;) . This should go to write_xml_file() (And probably to other places just before fopen(..., w), f.e. series). I've consolidated the code, however adding the unlink() would break things for anyone intentionally symlinking cib.xml from somewhere else (like a git repo). So I'm not so sure I should make the unlink() change :( Agree. I originally made it specific to pengine files. What do you prefer, simple wrapper in xml.c (f.e. unlink_and_write_xml_file()) or just add unlink() call to pengine before it calls write_xml_file()? The last one :) ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] Pacemaker Digest, Vol 66, Issue 58
Hi Andreas, thanks for your answer, crm_simulate -s -L (node2 is offline - r_postfix is running on node1) native_color: r_haproxy allocation score on node1: -INFINITY native_color: r_haproxy allocation score on node2: -INFINITY crm_simulate -s -L (both nodes are online - r_postfix is running on node1) native_color: r_haproxy allocation score on node1: -INFINITY native_color: r_haproxy allocation score on node2: 0 with 2 colocation and we see that colocation with score 100 is not setting colocation cl_r_haproxy_not_on_r_postfix -inf: r_haproxy r_postfix colocation cl_r_haproxy_on_r_postfix 100: r_haproxy r_postfix I don´t understand because score on node2 is 0 Regards, Wolfgang On 2013-05-15 21:30, Wolfgang Routschka wrote: Hi everybody, one question today about colocation rule on a 2-node cluster on scientific linux 6.4 and pacemaker/cman. 2-Node Cluster first node haproxy load balancer proxy service - second node with postfix service. colocation for running a group called g_ip-address (haproxy lsb-resouce and ipaddress resource) on the other node of the postfix server is cl_g_ip-address_not_on_r_postfix -inf: g_ip-address r_postfix -INF == never-ever ;-) The problem is now that the node with haproxy is down pacemaker cannot move/migrate the services to the other node -ok second colocation with lower score but it doesn?t works for me colocation cl_g_ip-address_on_r_postfix -1: g_ip-address r_postfix Whats my fault in these section? Hard to say without seeing the rest of your configuration, but you can run crm_simulate -s -L to see all the scores taken into account. Regards, Andreas -- Need help with Pacemaker? http://www.hastexo.com/now How can I migrate my group to the other if the master node for it is dead? Greetings Wolfgang ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org -- next part -- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 287 bytes Desc: OpenPGP digital signature URL: http://oss.clusterlabs.org/pipermail/pacemaker/attachments/20130515/2a392fa1/attachment.sig -- ___ Pacemaker mailing list Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker End of Pacemaker Digest, Vol 66, Issue 58 * ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] Stonith: How to avoid deathmatch cluster partitioning
On 2013-05-15T22:55:43, Andreas Kurz andr...@hastexo.com wrote: start-delay is an option of the monitor operation ... in fact means don't trust that start was successfull, wait for the initial monitor some more time It can be used on start here though to avoid exactly this situation; and it works fine for that, effectively being equivalent to the delay option on stonith (since the start always precedes the fence). The problem is, this would only make sense for one single stonith resource that can fence more nodes. In case of a split-brain that would delay the start on that node where the stonith resource was not running before and gives that node a penalty. Sure. In a split-brain scenario, one side will receive a penalty, that's the whole point of this exercise. In particular for the external/sbd agent. Or by grouping all fencing resources to always run on one node; if you don't have access to RHT fence agents, for example. external/sbd also has code to avoid a death-match cycle in case of persistent split-brain scenarios now; after a reboot, the node that was fenced will not join unless the fence is cleared first. (The RHT world calls that unfence, I believe.) That should be a win for the fence_sbd that I hope to get around to sometime in the next few months, too ;-) In your example with two stonith resources running all the time, Digimer's suggestion is a good idea: use one of the redhat fencing agents, most of them have some sort of stonith-delay parameter that you can use with one instance. It'd make sense to have logic for this embedded at a higher level, somehow; the problem is all too common. Of course, it is most relevant in scenarios where split brain is a significantly higher probability than node down. Which is true for most test scenarios (admins love yanking cables), but in practice, it's mostly truly the node down. Regards, Lars -- Architect Storage/HA SUSE LINUX Products GmbH, GF: Jeff Hawn, Jennifer Guild, Felix Imendörffer, HRB 21284 (AG Nürnberg) Experience is the name everyone gives to their mistakes. -- Oscar Wilde ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] Stonith: How to avoid deathmatch cluster partitioning
Hi Andreas! On 15.05.2013 22:55, Andreas Kurz wrote: On 2013-05-15 15:34, Klaus Darilion wrote: On 15.05.2013 14:51, Digimer wrote: On 05/15/2013 08:37 AM, Klaus Darilion wrote: primitive st-pace1 stonith:external/xen0 \ params hostlist=pace1 dom0=xentest1 \ op start start-delay=15s interval=0 Try; primitive st-pace1 stonith:external/xen0 \ params hostlist=pace1 dom0=xentest1 delay=15 \ op start start-delay=15s interval=0 The idea here is that, when both nodes lose contact and initiate a fence, 'st-pace1' will get a 15 second reprieve. That is, 'st-pace2' will wait 15 seconds before trying to fence 'st-pace1'. If st-pace1 is still alive, it will fence 'st-pace2' without delay, so pace2 will be dead before it's timer expires, preventing a dual-fence. However, if pace1 really is dead, pace2 will fence it and recovery, just with a 15 second delay. Sounds good, but pacemaker does not accept the parameter: ERROR: st-pace1: parameter delay does not exist start-delay is an option of the monitor operation ... in fact means don't trust that start was successfull, wait for the initial monitor some more time The problem is, this would only make sense for one single stonith resource that can fence more nodes. In case of a split-brain that would delay the start on that node where the stonith resource was not running before and gives that node a penalty. Thanks for the clarification. I already thought that the start-delay workaround is not useful in my setup. In your example with two stonith resources running all the time, Digimer's suggestion is a good idea: use one of the redhat fencing agents, most of them have some sort of stonith-delay parameter that you can use with one instance. I found it somehow confusing that a generic parameter (delay is useful for all stonith agents) is implemented in the agent, not in pacemaker. Further, downloading the RH source RPMS and extracting the agents is also quite cumbersome. I think I will add the delay parameter to the relevant fencing agent myself. I guess I also have increase the stonith-timeout and add the configured delay. Do you know how to submit patches for the stonith agents? Thanks Klaus ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] error with cib synchronisation on disk
On 16.05.2013 07:14, Andrew Beekhof wrote: On 15/05/2013, at 9:53 PM, Халезов Иван i.khale...@rts.ru wrote: Hello everyone! Some problems occured with synchronisation CIB configuration to disk. I have this errors in pacemaker's logfile: What were the messages before this? Did it happen once or many times? At startup or while the cluster was running? I had updated cluster configuration before, so there was some output about it in the logfile (not from the beginning here, because it is rather big): May 14 13:29:13 iblade6 cib[2848]: info: cib:diff: - primitive id=Security_A May 14 13:29:13 iblade6 cib[2848]: info: cib:diff: - meta_attributes id=Security_A-meta_attributes May 14 13:29:13 iblade6 cib[2848]: info: cib:diff: - nvpair id=Security_A-meta_attributes-target-role name=target-role value=Stopped __crm_diff_marker__=r emoved:top / May 14 13:29:13 iblade6 cib[2848]: info: cib:diff: - /meta_attributes May 14 13:29:13 iblade6 cib[2848]: info: cib:diff: - /primitive May 14 13:29:13 iblade6 cib[2848]: info: cib:diff: - primitive id=Security_B May 14 13:29:13 iblade6 cib[2848]: info: cib:diff: - meta_attributes id=SPBEX_Security_B-meta_attributes May 14 13:29:13 iblade6 cib[2848]: info: cib:diff: - nvpair id=Security_B-meta_attributes-target-role name=target-role value=Started __crm_diff_marker__=removed:top / May 14 13:29:13 iblade6 cib[2848]: info: cib:diff: - /meta_attributes May 14 13:29:13 iblade6 cib[2848]: info: cib:diff: - /primitive May 14 13:29:13 iblade6 cib[2848]: info: cib:diff: - /group May 14 13:29:13 iblade6 cib[2848]: info: cib:diff: - /resources May 14 13:29:13 iblade6 cib[2848]: info: cib:diff: - /configuration May 14 13:29:13 iblade6 cib[2848]: info: cib:diff: - /cib May 14 13:29:13 iblade6 cib[2848]: info: cib:diff: + cib epoch=496 num_updates=1 admin_epoch=0 validate-with=pacemaker-1.2 cib-last-written=Mon May 13 18:50:25 2013 crm_feature_set=3.0.6 update-origin=iblade6.net.rts update-client=cibadmin have-quorum=1 dc-uuid=2130706433 May 14 13:29:13 iblade6 cib[2848]: info: cib:diff: + configuration May 14 13:29:13 iblade6 cib[2848]: info: cib:diff: + resources May 14 13:29:13 iblade6 cib[2848]: info: cib:diff: + group id=FAST_SENDERS May 14 13:29:13 iblade6 cib[2848]: info: cib:diff: + meta_attributes id=FAST_SENDERS-meta_attributes __crm_diff_marker__=added:top May 14 13:29:13 iblade6 cib[2848]: info: cib:diff: + nvpair id=FAST_SENDERS-meta_attributes-target-role name=target-role value=Started / May 14 13:29:13 iblade6 cib[2848]: info: cib:diff: + /meta_attributes May 14 13:29:13 iblade6 cib[2848]: info: cib:diff: + /group May 14 13:29:13 iblade6 cib[2848]: info: cib:diff: + /resources May 14 13:29:13 iblade6 cib[2848]: info: cib:diff: + /configuration May 14 13:29:13 iblade6 cib[2848]: info: cib:diff: + /cib May 14 13:29:13 iblade6 cib[2848]: info: cib_process_request: Operation complete: op cib_replace for section resources (origin=local/cibadmin/2, version=0.496.1): ok (rc=0) May 14 13:29:13 iblade6 pengine[2852]: notice: LogActions: Start Trades_INCR_A#011(iblade6.net.rts) May 14 13:29:13 iblade6 pengine[2852]: notice: LogActions: Start Trades_INCR_B#011(iblade6.net.rts) May 14 13:29:13 iblade6 pengine[2852]: notice: LogActions: Start Security_A#011(iblade6.net.rts) May 14 13:29:13 iblade6 pengine[2852]: notice: LogActions: Start Security_B#011(iblade6.net.rts) May 14 13:29:13 iblade6 crmd[2853]: notice: do_state_transition: State transition S_POLICY_ENGINE - S_TRANSITION_ENGINE [ input=I_PE_SUCCESS cause=C_IPC_MESSAGE origin=handle_response ] May 14 13:29:13 iblade6 crmd[2853]: info: do_te_invoke: Processing graph 41 (ref=pe_calc-dc-1368523753-125) derived from /var/lib/pengine/pe-input-452.bz2 May 14 13:29:13 iblade6 crmd[2853]: info: te_rsc_command: Initiating action 80: start Trades_INCR_A_start_0 on iblade6.net.rts (local) May 14 13:29:13 iblade6 cluster:error: validate_cib_digest: Digest comparision failed: expected 2c91194022c98636f90df9dd5e7176c6 (/var/lib/heartbeat/crm/cib.Zm249H), calculated bc160870924630b3907c8cb1c3128eee May 14 13:29:13 iblade6 cluster:error: retrieveCib: Checksum of /var/lib/heartbeat/crm/cib.a024wF failed! Configuration contents ignored! May 14 13:29:13 iblade6 cluster:error: retrieveCib: Usually this is caused by manual changes, please refer to http://clusterlabs.org/wiki/FAQ#cib_changes_detected May 14 13:29:13 iblade6 cluster:error: crm_abort: write_cib_contents: Triggered fatal assert at io.c:662 : retrieveCib(tmp1, tmp2, FALSE) != NULL May 14 13:29:13 iblade6 pengine[2852]: notice: process_pe_message: Transition 41: PEngine Input stored in: /var/lib/pengine/pe-input-452.bz2 May 14 13:29:13 iblade6 cib[2848]:error: cib_diskwrite_complete: Disk write failed: status=134, signo=6, exitcode=0 May 14 13:29:13 iblade6 cib[2848]:error:
Re: [Pacemaker] pacemaker colocation after one node is down
Hi Andreas, thank you for your answer. solutions is one coloation with -score colocation cl_g_ip-address_not_on_r_postfix -1: g_ip-address r_postfix Greetings Wolfgang On 2013-05-15 21:30, Wolfgang Routschka wrote: Hi everybody, one question today about colocation rule on a 2-node cluster on scientific linux 6.4 and pacemaker/cman. 2-Node Cluster first node haproxy load balancer proxy service - second node with postfix service. colocation for running a group called g_ip-address (haproxy lsb-resouce and ipaddress resource) on the other node of the postfix server is cl_g_ip-address_not_on_r_postfix -inf: g_ip-address r_postfix -INF == never-ever ;-) The problem is now that the node with haproxy is down pacemaker cannot move/migrate the services to the other node -ok second colocation with lower score but it doesn?t works for me colocation cl_g_ip-address_on_r_postfix -1: g_ip-address r_postfix Whats my fault in these section? Hard to say without seeing the rest of your configuration, but you can run crm_simulate -s -L to see all the scores taken into account. Regards, Andreas -- Need help with Pacemaker? http://www.hastexo.com/now How can I migrate my group to the other if the master node for it is dead? Greetings Wolfgang ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org -- next part -- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 287 bytes Desc: OpenPGP digital signature URL: http://oss.clusterlabs.org/pipermail/pacemaker/attachments/20130515/2a392fa1/attachment.sig -- ___ Pacemaker mailing list Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker End of Pacemaker Digest, Vol 66, Issue 58 * ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
[Pacemaker] stonith-ng: error: remote_op_done: Operation reboot of node2 by node1 for stonith_admin: Timer expired
Using Pacemaker 1.1.8 on EL6.4 with the pacemaker plugin, I'm finding strange behavior with stonith-admin -B node2. It seems to shut the node down but not start it back up and ends up reporting a timer expired: # stonith_admin -B node2 Command failed: Timer expired The pacemaker log for the operation is: May 16 13:50:41 node1 stonith_admin[23174]: notice: crm_log_args: Invoked: stonith_admin -B node2 May 16 13:50:41 node1 stonith-ng[1673]: notice: handle_request: Client stonith_admin.23174.4a093de2 wants to fence (reboot) 'node2' with device '(any)' May 16 13:50:41 node1 stonith-ng[1673]: notice: initiate_remote_stonith_op: Initiating remote operation reboot for node2: aa230634-6a38-42b7-8ed4-0a0eb64af39a (0) May 16 13:50:41 node1 cibadmin[23176]: notice: crm_log_args: Invoked: cibadmin --query May 16 13:50:49 node1 corosync[1376]: [TOTEM ] A processor failed, forming new configuration. May 16 13:50:55 node1 corosync[1376]: [pcmk ] notice: pcmk_peer_update: Transitional membership event on ring 76: memb=1, new=0, lost=1 May 16 13:50:55 node1 corosync[1376]: [pcmk ] info: pcmk_peer_update: memb: node1 4252674240 May 16 13:50:55 node1 corosync[1376]: [pcmk ] info: pcmk_peer_update: lost: node2 2608507072 May 16 13:50:55 node1 corosync[1376]: [pcmk ] notice: pcmk_peer_update: Stable membership event on ring 76: memb=1, new=0, lost=0 May 16 13:50:55 node1 corosync[1376]: [pcmk ] info: pcmk_peer_update: MEMB: node1 4252674240 May 16 13:50:55 node1 corosync[1376]: [pcmk ] info: ais_mark_unseen_peer_dead: Node node2 was not seen in the previous transition May 16 13:50:55 node1 corosync[1376]: [pcmk ] info: update_member: Node 2608507072/node2 is now: lost May 16 13:50:55 node1 corosync[1376]: [pcmk ] info: send_member_notification: Sending membership update 76 to 2 children May 16 13:50:55 node1 corosync[1376]: [TOTEM ] A processor joined or left the membership and a new membership was formed. May 16 13:50:55 node1 corosync[1376]: [CPG ] chosen downlist: sender r(0) ip(192.168.122.253) r(1) ip(10.0.0.253) ; members(old:2 left:1) May 16 13:50:55 node1 corosync[1376]: [MAIN ] Completed service synchronization, ready to provide service. May 16 13:50:55 node1 cib[1672]: notice: ais_dispatch_message: Membership 76: quorum lost May 16 13:50:55 node1 cib[1672]: notice: crm_update_peer_state: crm_update_ais_node: Node node2[2608507072] - state is now lost May 16 13:50:55 node1 crmd[1677]: notice: ais_dispatch_message: Membership 76: quorum lost May 16 13:50:55 node1 crmd[1677]: notice: crm_update_peer_state: crm_update_ais_node: Node node2[2608507072] - state is now lost May 16 13:50:55 node1 crmd[1677]: warning: match_down_event: No match for shutdown action on node2 May 16 13:50:55 node1 crmd[1677]: notice: peer_update_callback: Stonith/shutdown of node2 not matched May 16 13:50:55 node1 crmd[1677]: notice: do_state_transition: State transition S_IDLE - S_INTEGRATION [ input=I_NODE_JOIN cause=C_FSA_INTERNAL origin=check_join_state ] May 16 13:50:57 node1 attrd[1675]: notice: attrd_local_callback: Sending full refresh (origin=crmd) May 16 13:50:57 node1 attrd[1675]: notice: attrd_trigger_update: Sending flush op to all hosts for: last-failure-resource1 (1368710825) May 16 13:50:57 node1 attrd[1675]: notice: attrd_trigger_update: Sending flush op to all hosts for: probe_complete (true) May 16 13:50:58 node1 pengine[1676]: notice: unpack_config: On loss of CCM Quorum: Ignore May 16 13:50:58 node1 pengine[1676]: crit: get_timet_now: Defaulting to 'now' May 16 13:50:58 node1 pengine[1676]: crit: get_timet_now: Defaulting to 'now' May 16 13:50:58 node1 pengine[1676]: crit: get_timet_now: Defaulting to 'now' May 16 13:50:58 node1 pengine[1676]: crit: get_timet_now: Defaulting to 'now' May 16 13:50:58 node1 pengine[1676]: crit: get_timet_now: Defaulting to 'now' May 16 13:50:58 node1 pengine[1676]: crit: get_timet_now: Defaulting to 'now' May 16 13:50:58 node1 pengine[1676]: crit: get_timet_now: Defaulting to 'now' May 16 13:50:58 node1 pengine[1676]: crit: get_timet_now: Defaulting to 'now' May 16 13:50:58 node1 pengine[1676]: crit: get_timet_now: Defaulting to 'now' May 16 13:50:58 node1 pengine[1676]: crit: get_timet_now: Defaulting to 'now' May 16 13:50:58 node1 pengine[1676]: crit: get_timet_now: Defaulting to 'now' May 16 13:50:58 node1 pengine[1676]: crit: get_timet_now: Defaulting to 'now' May 16 13:50:58 node1 pengine[1676]: crit: get_timet_now: Defaulting to 'now' May 16 13:50:58 node1 pengine[1676]: crit: get_timet_now: Defaulting to 'now' May 16 13:50:58 node1 pengine[1676]: crit: get_timet_now: Defaulting to 'now' May 16 13:50:58 node1 pengine[1676]: crit: get_timet_now: Defaulting to 'now' May 16 13:50:58 node1 pengine[1676]: crit: get_timet_now: Defaulting to 'now' May 16 13:50:58 node1 pengine[1676]: crit: get_timet_now:
Re: [Pacemaker] Loss of ocf:pacemaker:ping target forces resources to restart?
The cluster has 3 connections total. The first connection is the outside interface where services can communicate and is also used for cluster communication using mcast. The second interface is a cross-over that is solely for cluster communication. The third connection is another cross-over solely for DRBD replication. This issue happens when the first connection that is used for both the services and cluster communication is pulled on both nodes at the same time. ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] pacemaker colocation after one node is down
On 2013-05-16 13:42, Wolfgang Routschka wrote: Hi Andreas, thank you for your answer. solutions is one coloation with -score ah, yes only _one_ of them with a non-negative value is needed. Scores of all constraints are added up. Regards, Andreas colocation cl_g_ip-address_not_on_r_postfix -1: g_ip-address r_postfix Greetings Wolfgang On 2013-05-15 21:30, Wolfgang Routschka wrote: Hi everybody, one question today about colocation rule on a 2-node cluster on scientific linux 6.4 and pacemaker/cman. 2-Node Cluster first node haproxy load balancer proxy service - second node with postfix service. colocation for running a group called g_ip-address (haproxy lsb-resouce and ipaddress resource) on the other node of the postfix server is cl_g_ip-address_not_on_r_postfix -inf: g_ip-address r_postfix -INF == never-ever ;-) The problem is now that the node with haproxy is down pacemaker cannot move/migrate the services to the other node -ok second colocation with lower score but it doesn?t works for me colocation cl_g_ip-address_on_r_postfix -1: g_ip-address r_postfix Whats my fault in these section? Hard to say without seeing the rest of your configuration, but you can run crm_simulate -s -L to see all the scores taken into account. Regards, Andreas -- Need help with Pacemaker? http://www.hastexo.com/now signature.asc Description: OpenPGP digital signature ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] crm subshell 1.2.4 incompatible to pacemaker 1.1.9?
The bug is in the function is_normal_node. This function checks the attribute type for state normal. But this attribute is not used any more. CIB output from Pacemaker 1.1.8 nodes node id=int2node1 uname=int2node1 instance_attributes id=nodes-int2node1 nvpair id=nodes-int2node1-standby name=standby value=off/ /node node id=int2node2 uname=int2node2 instance_attributes id=nodes-int2node2 nvpair id=nodes-int2node2-standby name=standby value=on/ /node /nodes CIB output from Pacemaker 1.1.7 nodes node id=int1node1 type=normal uname=int1node1 /node node id=int1node2 type=normal uname=int1node2 /node /nodes Therefore, function listnodes will not return any node and function standby will use the current node as node and the first argument as lifetime. In case of specified both (node and lifetime) it works because of other else path. Rainer Gesendet:Mittwoch, 15. Mai 2013 um 21:31 Uhr Von:Lars Ellenberg lars.ellenb...@linbit.com An:pacemaker@oss.clusterlabs.org Betreff:Re: [Pacemaker] crm subshell 1.2.4 incompatible to pacemaker 1.1.9? On Wed, May 15, 2013 at 03:34:14PM +0200, Dejan Muhamedagic wrote: On Tue, May 14, 2013 at 10:03:59PM +0200, Lars Ellenberg wrote: On Tue, May 14, 2013 at 09:59:50PM +0200, Lars Ellenberg wrote: On Mon, May 13, 2013 at 01:53:11PM +0200, Michael Schwartzkopff wrote: Hi, crm tells me it is version 1.2.4 pacemaker tell me it is verison 1.1.9 So it should work since incompatibilities are resolved in crm higher that version 1.2.1. Anywas crm tells me nonsense: # crm crm(live)# node crm(live)node# standby node1 ERROR: bad lifetime: node1 Your node is not named node1. check: crm node list Maybe a typo, maybe some case-is-significant nonsense, maybe you just forgot to use the fqdn. maybe the check for is this a known node name is (now) broken? standby with just one argument checks if that argument happens to be a known node name, and assumes that if it is not, it has to be a lifetime, and the current node is used as node name... Maybe we should invert that logic, and instead compare the single argument against allowed lifetime values (reboot, forever), and assume it is supposed to be a node name otherwise? Then the error would become ERROR: unknown node name: node1 Which is probably more useful most of the time. Dejan? Something like this maybe: diff --git a/modules/ui.py.in b/modules/ui.py.in --- a/modules/ui.py.in +++ b/modules/ui.py.in @@ -1185,7 +1185,7 @@ class NodeMgmt(UserInterface): if not args: node = vars.this_node if len(args) == 1: - if not args[0] in listnodes(): + if args[0] in (reboot, forever): Yes, I wanted to look at it again. Another complication is that the lifetime can be just about anything in that date ISO format. That may well be, but right now those would be rejected by crmsh anyways: if lifetime not in (None,reboot,forever): common_err(bad lifetime: %s % lifetime) return False -- : Lars Ellenberg : LINBIT Your Way to High Availability : DRBD/HA support and consulting http://www.linbit.com ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
[Pacemaker] having problem with crm cib shadow
crm(live)cib# use gfs2 ERROR: gfs2: no such shadow CIB crm(live)cib# new gfs2 A shadow instance 'gfs2' already exists. To prevent accidental destruction of the cluster, the --force flag is required in order to proceed. crm(live)cib# list crm(live)cib# use gfs2 ERROR: gfs2: no such shadow CIB crm(live)cib# ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] Loss of ocf:pacemaker:ping target forces resources to restart?
Andrew, I'd recommend adding more than one host to your p_ping resource and see if that improves the situation. When I had this problem, I observed better behavior after adding more than one IP to the list of hosts and changing the p_ping location constraint to be as follows: location loc_run_on_most_connected g_mygroup \ rule $id=loc_run_on_most_connected-rule -inf: not_defined p_ping or p_ping lte 0 More information: http://www.gossamer-threads.com/lists/linuxha/pacemaker/81502#81502 Hope this helps, Andrew - Original Message - From: Andrew Widdersheim awiddersh...@hotmail.com To: The Pacemaker cluster resource manager pacemaker@oss.clusterlabs.org Sent: Thursday, May 16, 2013 9:35:56 AM Subject: Re: [Pacemaker] Loss of ocf:pacemaker:ping target forces resources to restart? The cluster has 3 connections total. The first connection is the outside interface where services can communicate and is also used for cluster communication using mcast. The second interface is a cross-over that is solely for cluster communication. The third connection is another cross-over solely for DRBD replication. This issue happens when the first connection that is used for both the services and cluster communication is pulled on both nodes at the same time. ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] Loss of ocf:pacemaker:ping target forces resources to restart?
Thanks for the help. Adding another node to the ping host_list may help in some situations but the root issues doesn't really get solved. Also, the location constraint you posted is very different than mine. Your constraint requires connectivity where as the one I am trying to use looks for best connectivity. I have used the location constraint you posted with success in the past but I don't want my resource to be shut off in the event of a network outage that is happening across all nodes at the same time. Don't get me wrong in some cluster configurationss I do use the configuration you posted but this setup is not one of them for specific reasons. ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] having problem with crm cib shadow
Which Linux distribution and version of pacemaker are you using? /John On Thursday, 16 May 2013, George Gibat wrote: crm(live)cib# use gfs2 ERROR: gfs2: no such shadow CIB crm(live)cib# new gfs2 A shadow instance 'gfs2' already exists. To prevent accidental destruction of the cluster, the --force flag is required in order to proceed. crm(live)cib# list crm(live)cib# use gfs2 ERROR: gfs2: no such shadow CIB crm(live)cib# ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org javascript:; http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] having problem with crm cib shadow
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 centos 6.4, pacemaker 1.1.8-7.el6 On 2013-05-16 18:57, John McCabe wrote: Which Linux distribution and version of pacemaker are you using? /John On Thursday, 16 May 2013, George Gibat wrote: crm(live)cib# use gfs2 ERROR: gfs2: no such shadow CIB crm(live)cib# new gfs2 A shadow instance 'gfs2' already exists. To prevent accidental destruction of the cluster, the --force flag is required in order to proceed. crm(live)cib# list crm(live)cib# use gfs2 ERROR: gfs2: no such shadow CIB crm(live)cib# ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org javascript:; http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org - -- --- George Gibat, Technical Director, CCNP, MSCE, CISSP, CNE TTFN PGP public key - http://www.gibat.com/ggibat-pub.asc Gibat Enterprises, Inc Connecting you to the world (R) Your Portal to the Future (R) http://www.gibat.com http://www.spi.net 817.265.9962 9260 Walker Rd. Ovid, MI 48866 The information contained in and transmitted with this email is or may be confidential and/or privileged. It is intended only for the individual or entity designated. You are hereby notified that any dissemination, distribution, copying, use of or reliance upon the information contained in and transmitted with this email by or to anyone other than the intended recipient designated by the sender is unauthorized and strictly prohibited. If you have received this email in error, please contact the sender at (817)265-9962. Any email erroneously transmitted to you should be immediately deleted. -BEGIN PGP SIGNATURE- Version: GnuPG v2.0.16 (MingW32) Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/ iEYEARECAAYFAlGVMfoACgkQaWdaxHduXchnAACcDnnu3cWSKjfp4aDg8y+65jvW GmQAnR4PP1AYntV0qGZ87q8o0BdTRHjD =eSll -END PGP SIGNATURE- ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
[Pacemaker] question about interface failover
Greetings, I've setup a new 2-node mysql cluster using * drbd 8.3.1.3 * corosync 1.4.2 * pacemaker 117 on Debian Wheezy nodes. failover seems to be working fine for everything except the ips manually configured on the interfaces. see config here: http://pastebin.aquilenet.fr/?9eb51f6fb7d65fda#/YvSiYFocOzogAmPU9g +g09RcJvhHbgrY1JuN7D+gA4= If I bring down an interface, when the cluster restarts it, it only starts it with the vip - the original ip and route have been removed. not sure what to do to make sure the permanent ip and the routes get restored. I'm not all that versed on the cluster commandline yet, and I'm using LCMC for most of my usage. Thanks for your help, -C ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] having problem with crm cib shadow
Worth trying crm_shadow as described here - http://www.gossamer-threads.com/lists/linuxha/pacemaker/84969 I had the same problem and took it as a sign that I should just move to pcs (from the RHEL repo, not the latest source), which went pretty smoothly, only had a few problems with assigning parameters to resources.. but that could easily be worked around using crm_resource. On 16 May 2013 20:23, George G. Gibat ggi...@gibat.com wrote: -BEGIN PGP SIGNED MESSAGE- Hash: SHA1 centos 6.4, pacemaker 1.1.8-7.el6 On 2013-05-16 18:57, John McCabe wrote: Which Linux distribution and version of pacemaker are you using? /John On Thursday, 16 May 2013, George Gibat wrote: crm(live)cib# use gfs2 ERROR: gfs2: no such shadow CIB crm(live)cib# new gfs2 A shadow instance 'gfs2' already exists. To prevent accidental destruction of the cluster, the --force flag is required in order to proceed. crm(live)cib# list crm(live)cib# use gfs2 ERROR: gfs2: no such shadow CIB crm(live)cib# ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org javascript:; http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org - -- --- George Gibat, Technical Director, CCNP, MSCE, CISSP, CNE TTFN PGP public key - http://www.gibat.com/ggibat-pub.asc Gibat Enterprises, Inc Connecting you to the world (R) Your Portal to the Future (R) http://www.gibat.com http://www.spi.net 817.265.9962 9260 Walker Rd. Ovid, MI 48866 The information contained in and transmitted with this email is or may be confidential and/or privileged. It is intended only for the individual or entity designated. You are hereby notified that any dissemination, distribution, copying, use of or reliance upon the information contained in and transmitted with this email by or to anyone other than the intended recipient designated by the sender is unauthorized and strictly prohibited. If you have received this email in error, please contact the sender at (817)265-9962. Any email erroneously transmitted to you should be immediately deleted. -BEGIN PGP SIGNATURE- Version: GnuPG v2.0.16 (MingW32) Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/ iEYEARECAAYFAlGVMfoACgkQaWdaxHduXchnAACcDnnu3cWSKjfp4aDg8y+65jvW GmQAnR4PP1AYntV0qGZ87q8o0BdTRHjD =eSll -END PGP SIGNATURE- ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
[Pacemaker] pacemaker-remote tls handshaking
I've built pacemaker 1.1.10rc2 and am trying to get the pacemaker-remote features working on my Scientific Linux 6.4 system. It almost works... The /etc/pacemaker/authkey file is on all the cluster nodes, as well as my test VM (readable to all users, and checksums are the same everywhere). I can connect via telnet to port 3121 of the VM. I even see the ghost node appear for my VM when I use either 'crm status' or 'pcs status'. (Aside: crmsh doesn't know about the new meta attributes for remote...) But the communication isn't quite working. In my log I see: May 16 15:58:34 cvmh04 crmd[4893]: warning: lrmd_tcp_connect_cb: Client tls han dshake failed for server swbuildsl6:3121. Disconnecting May 16 15:58:34 swbuildsl6 pacemaker_remoted[2308]:error: lrmd_remote_client _msg: Remote lrmd tls handshake failed May 16 15:58:35 cvmh04 crmd[4893]: warning: lrmd_tcp_connect_cb: Client tls han dshake failed for server swbuildsl6:3121. Disconnecting May 16 15:58:35 swbuildsl6 pacemaker_remoted[2308]:error: lrmd_remote_client _msg: Remote lrmd tls handshake failed and it isn't long before pacemaker stops trying. Is there some additional configuration I need? /Lindsay ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
[Pacemaker] mysql ocf resource agent - resource stays unmanaged if binary unavailable
Hi, our pacemaker setup provides mysql resource using ocf resource agent. Today I tested with my colleagues forcing mysql resource to fail. I don't understand the following behaviour. When I remove the mysqld_safe binary (which path is specified in crm config) from one server and moving the mysql resource to this server, the resource will not fail back and stays in the unmanaged status. We can see that the function check_binary(); is called within the mysql ocf resource agent and exists with error code 5. The fail-count gets raised to INFINITY and pacemaker tries to stop the resource fails. This results in a unmanaged status. How to reproduce: 1. mysql resource is running on node1 2. on node2 mv /usr/bin/mysqld_safe{,.bak} 3. crm resource move group-MySQL node2 4. observe corosync.log and crm_mon # cat /var/log/corosync/corosync.log [...] May 16 10:53:41 node2 lrmd: [1893]: info: operation start[119] on res-MySQL-IP1 for client 1896: pid 5137 exited with return code 0 May 16 10:53:41 node2 crmd: [1896]: info: process_lrm_event: LRM operation res-MySQL-IP1_start_0 (call=119, rc=0, cib-update=98, confirmed=true) ok May 16 10:53:41 node2 crmd: [1896]: info: do_lrm_rsc_op: Performing key=94:102:0:28dea763-d2a2-4b9d-b86a-5357760ed16e op=res-MySQL-IP1_monitor_3 ) May 16 10:53:41 node2 lrmd: [1893]: info: rsc:res-MySQL-IP1 monitor[120] (pid 5222) May 16 10:53:41 node2 crmd: [1896]: info: do_lrm_rsc_op: Performing key=96:102:0:28dea763-d2a2-4b9d-b86a-5357760ed16e op=res-MySQL_start_0 ) May 16 10:53:41 node2 lrmd: [1893]: info: rsc:res-MySQL start[121] (pid 5223) May 16 10:53:41 node2 lrmd: [1893]: info: RA output: (res-MySQL:start:stderr) 2013/05/16_10:53:41 ERROR: Setup problem: couldn't find command: /usr/bin/mysqld_safe May 16 10:53:41 node2 lrmd: [1893]: info: operation start[121] on res-MySQL for client 1896: pid 5223 exited with return code 5 May 16 10:53:41 node2 crmd: [1896]: info: process_lrm_event: LRM operation res-MySQL_start_0 (call=121, rc=5, cib-update=99, confirmed=true) not installed May 16 10:53:41 node2 lrmd: [1893]: info: operation monitor[120] on res-MySQL-IP1 for client 1896: pid 5222 exited with return code 0 May 16 10:53:41 node2 crmd: [1896]: info: process_lrm_event: LRM operation res-MySQL-IP1_monitor_3 (call=120, rc=0, cib-update=100, confirmed=false) ok May 16 10:53:41 node2 attrd: [1894]: notice: attrd_ais_dispatch: Update relayed from node1 May 16 10:53:41 node2 attrd: [1894]: notice: attrd_trigger_update: Sending flush op to all hosts for: fail-count-res-MySQL (INFINITY) May 16 10:53:41 node2 attrd: [1894]: notice: attrd_perform_update: Sent update 44: fail-count-res-MySQL=INFINITY May 16 10:53:41 node2 attrd: [1894]: notice: attrd_ais_dispatch: Update relayed from node1 May 16 10:53:41 node2 attrd: [1894]: notice: attrd_trigger_update: Sending flush op to all hosts for: last-failure-res-MySQL (1368694421) May 16 10:53:41 node2 attrd: [1894]: notice: attrd_perform_update: Sent update 47: last-failure-res-MySQL=1368694421 May 16 10:53:41 node2 lrmd: [1893]: info: cancel_op: operation monitor[117] on res-DRBD-MySQL:1 for client 1896, its parameters: drbd_resource=[mysql] CRM_meta_role=[Master] CRM_meta_timeout=[2] CRM_meta_name=[monitor] crm_feature_set=[3.0.5] CRM_meta_notify=[true] CRM_meta_clone_node_max=[1] CRM_meta_clone=[1] CRM_meta_clone_max=[2] CRM_meta_master_node_max=[1] CRM_meta_interval=[29000] CRM_meta_globally_unique=[false] CRM_meta_master_max=[1] cancelled May 16 10:53:41 node2 crmd: [1896]: info: send_direct_ack: ACK'ing resource op res-DRBD-MySQL:1_monitor_29000 from 3:104:0:28dea763-d2a2-4b9d-b86a-5357760ed16e: lrm_invoke-lrmd-1368694421-57 May 16 10:53:41 node2 crmd: [1896]: info: do_lrm_rsc_op: Performing key=8:104:0:28dea763-d2a2-4b9d-b86a-5357760ed16e op=res-MySQL_stop_0 ) May 16 10:53:41 node2 lrmd: [1893]: info: rsc:res-MySQL stop[122] (pid 5278) [...] I can not figure out why the fail-count gets raised to INFINITY and especially why pacemaker tries to stop the resource after failing. Shouldn't it be the best for the resource to fail back to another node instead of resulting in a unmanaged status on the node? is it possible to force this behavior in any way? Here some specs of the software used on our cluster nodes: node1:~# lsb_release -d dpkg -l pacemaker | awk '/ii/{print $2,$3}' uname -ri Description:Ubuntu 12.04.2 LTS pacemaker 1.1.6-2ubuntu3 3.2.0-41-generic x86_64 Best regards Vladimir ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] pacemaker-remote tls handshaking
- Original Message - From: Lindsay Todd rltodd@gmail.com To: The Pacemaker cluster resource manager Pacemaker@oss.clusterlabs.org Sent: Thursday, May 16, 2013 3:44:09 PM Subject: [Pacemaker] pacemaker-remote tls handshaking I've built pacemaker 1.1.10rc2 and am trying to get the pacemaker-remote features working on my Scientific Linux 6.4 system. It almost works... The /etc/pacemaker/authkey file is on all the cluster nodes, as well as my test VM (readable to all users, and checksums are the same everywhere). I can connect via telnet to port 3121 of the VM. I even see the ghost node appear for my VM when I use either 'crm status' or 'pcs status'. (Aside: crmsh doesn't know about the new meta attributes for remote...) But the communication isn't quite working. In my log I see: May 16 15:58:34 cvmh04 crmd[4893]: warning: lrmd_tcp_connect_cb: Client tls han dshake failed for server swbuildsl6:3121. Disconnecting May 16 15:58:34 swbuildsl6 pacemaker_remoted[2308]: error: lrmd_remote_client _msg: Remote lrmd tls handshake failed May 16 15:58:35 cvmh04 crmd[4893]: warning: lrmd_tcp_connect_cb: Client tls han dshake failed for server swbuildsl6:3121. Disconnecting May 16 15:58:35 swbuildsl6 pacemaker_remoted[2308]: error: lrmd_remote_client _msg: Remote lrmd tls handshake failed and it isn't long before pacemaker stops trying. Is there some additional configuration I need? Ah, you dared to try my new feature, and this is what you get! :D It looks like you have it covered. If you can telnet into the vm from the host (it should kick you off pretty quickly), then then all the firewall rules are correct. I'm not sure what is going on. The only thing I can think of is perhaps your gnutls version doesn't like that I'm using a non-blocking socket during the tls handshake. I doubt this will make a difference, but here's the key I use during testing, lrmd:ce9db0bc3cec583d3b3bf38b0ac9ff91 Has anyone else had success or ran into something similar yet? I'll help investigate this next week. I'll be out of the office until Tuesday. -- Vossel /Lindsay ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] [Question and Problem] In vSphere5.1 environment, IO blocking of pengine occurs at the time of shared disk trouble for a long time.
Hi Andrew, Hi Vladislav, I try whether this correction is effective for this problem. * https://github.com/beekhof/pacemaker/commit/eb6264bf2db395779e65dadf1c626e050a388c59 Best Regards, Hideo Yamauchi. --- On Thu, 2013/5/16, Andrew Beekhof and...@beekhof.net wrote: On 16/05/2013, at 3:49 PM, Vladislav Bogdanov bub...@hoster-ok.com wrote: 16.05.2013 02:46, Andrew Beekhof wrote: On 15/05/2013, at 6:44 PM, Vladislav Bogdanov bub...@hoster-ok.com wrote: 15.05.2013 11:18, Andrew Beekhof wrote: On 15/05/2013, at 5:31 PM, Vladislav Bogdanov bub...@hoster-ok.com wrote: 15.05.2013 10:25, Andrew Beekhof wrote: On 15/05/2013, at 3:50 PM, Vladislav Bogdanov bub...@hoster-ok.com wrote: 15.05.2013 08:23, Andrew Beekhof wrote: On 15/05/2013, at 3:11 PM, renayama19661...@ybb.ne.jp wrote: Hi Andrew, Thank you for comments. The guest located it to the shared disk. What is on the shared disk? The whole OS or app-specific data (i.e. nothing pacemaker needs directly)? Shared disk has all the OS and the all data. Oh. I can imagine that being problematic. Pacemaker really isn't designed to function without disk access. You might be able to get away with it if you turn off saving PE files to disk though. I store CIB and PE files to tmpfs, and sync them to remote storage (CIFS) with lsyncd level 1 config (I may share it on request). It copies critical data like cib.xml, and moves everything else, symlinking it to original place. The same technique may apply here, but with local fs instead of cifs. Btw, the following patch is needed for that, otherwise pacemaker overwrites remote files instead of creating new ones on tmpfs: --- a/lib/common/xml.c 2011-02-11 11:42:37.0 +0100 +++ b/lib/common/xml.c 2011-02-24 15:07:48.541870829 +0100 @@ -529,6 +529,8 @@ write_file(const char *string, const char *filename) return -1; } + unlink(filename); Seems like it should be safe to include for normal operation. Exactly. Small flaw in that logic... write_file() is not used anywhere. Heh, thanks for spotting this. I recall write_file() was used for pengine, but some other function for CIB. You probably optimized that but forgot to remove unused function, that's why I was sure patch is still valid. And I did tests (CIFS storage outage simulation) only after initial patch, but not last years, that's why I didn't notice the regression - storage uses pacemaker too ;) . This should go to write_xml_file() (And probably to other places just before fopen(..., w), f.e. series). I've consolidated the code, however adding the unlink() would break things for anyone intentionally symlinking cib.xml from somewhere else (like a git repo). So I'm not so sure I should make the unlink() change :( Agree. I originally made it specific to pengine files. What do you prefer, simple wrapper in xml.c (f.e. unlink_and_write_xml_file()) or just add unlink() call to pengine before it calls write_xml_file()? The last one :) ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] pacemaker-1.1.10 results in Failed to sign on to the LRM 7
Just tried the patch you gave and it worked fine. Any plans on putting this patch in officially or was this a one off? Aside from this patch I guess the only thing to get things to work is to install things slightly differently and adding a symlink from cluster-glue's lrmd to pacemakers. Subject: Re: [Pacemaker] pacemaker-1.1.10 results in Failed to sign on to the LRM 7 From: and...@beekhof.net Date: Thu, 16 May 2013 15:20:59 +1000 CC: pacemaker@oss.clusterlabs.org To: awiddersh...@hotmail.com On 16/05/2013, at 3:16 PM, Andrew Widdersheim awiddersh...@hotmail.com wrote: I'll look into moving over to the cman option since that is preferred for RHEL6.4 now if I'm not mistaken. Correct I'll also try out the patch provided and see how that goes. So was LRMD not apart of pacemaker previously and later added? Was it originally apart of heartbeat/cluster-glue? I'm just trying to figure out all of the pieces so that I know how to fix if I choose to go down that road. Originally everything was part of heartbeat. Then what was then called the crm became pacemaker and the lrmd v1 became part of cluster-glue (because the theory was that someone might use it for a pacemaker alternative). That never happened and we stopped using almost everything else from cluster-glue, so when lrmd v2 was written, it was done so as part of pacemaker. or, tl;dr - yes and yes :) ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] pacemaker-1.1.10 results in Failed to sign on to the LRM 7
On 17/05/2013, at 11:38 AM, Andrew Widdersheim awiddersh...@hotmail.com wrote: Just tried the patch you gave and it worked fine. Any plans on putting this patch in officially or was this a one off? It will be in 1.1.10-rc3 soon Aside from this patch I guess the only thing to get things to work is to install things slightly differently and adding a symlink from cluster-glue's lrmd to pacemakers. Excellent Subject: Re: [Pacemaker] pacemaker-1.1.10 results in Failed to sign on to the LRM 7 From: and...@beekhof.net Date: Thu, 16 May 2013 15:20:59 +1000 CC: pacemaker@oss.clusterlabs.org To: awiddersh...@hotmail.com On 16/05/2013, at 3:16 PM, Andrew Widdersheim awiddersh...@hotmail.com wrote: I'll look into moving over to the cman option since that is preferred for RHEL6.4 now if I'm not mistaken. Correct I'll also try out the patch provided and see how that goes. So was LRMD not apart of pacemaker previously and later added? Was it originally apart of heartbeat/cluster-glue? I'm just trying to figure out all of the pieces so that I know how to fix if I choose to go down that road. Originally everything was part of heartbeat. Then what was then called the crm became pacemaker and the lrmd v1 became part of cluster-glue (because the theory was that someone might use it for a pacemaker alternative). That never happened and we stopped using almost everything else from cluster-glue, so when lrmd v2 was written, it was done so as part of pacemaker. or, tl;dr - yes and yes :) ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] pacemaker-1.1.10 results in Failed to sign on to the LRM 7
I'm attaching 3 patches I made fairly quickly to fix the installation issues and also an issue I noticed with the ping ocf from the latest pacemaker. One is for cluster-glue to prevent lrmd from building and later installing. May also want to modify this patch to take lrmd out of both spec files included when you download the source if you plan to build an rpm. I'm not sure if what I did here is the best way to approach this problem so if anyone has anything better please let me know. One is for pacemaker to create the lrmd symlink when building with heartbeat support. Note the spec does not need anything changed here. Finally, saw the following errors in messages with the latest ping ocf and the attached patch seems to fix the issue. May 16 01:10:13 node2 lrmd[16133]: notice: operation_finished: p_ping_monitor_5000:17758 [ /usr/lib/ocf/resource.d/pacemaker/ping: line 296: [: : integer expression expected ] cluster-glue-no-lrmd.patch Description: Binary data pacemaker-lrmd-hb.patch Description: Binary data pacemaker-ping-failure.patch Description: Binary data ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] [Question and Problem] In vSphere5.1 environment, IO blocking of pengine occurs at the time of shared disk trouble for a long time.
On 17/05/2013, at 10:27 AM, renayama19661...@ybb.ne.jp wrote: Hi Andrew, Hi Vladislav, I try whether this correction is effective for this problem. * https://github.com/beekhof/pacemaker/commit/eb6264bf2db395779e65dadf1c626e050a388c59 Doubtful, it just reduces code duplication. But it would also be a single place to put a deployment specific patch :) Best Regards, Hideo Yamauchi. --- On Thu, 2013/5/16, Andrew Beekhof and...@beekhof.net wrote: On 16/05/2013, at 3:49 PM, Vladislav Bogdanov bub...@hoster-ok.com wrote: 16.05.2013 02:46, Andrew Beekhof wrote: On 15/05/2013, at 6:44 PM, Vladislav Bogdanov bub...@hoster-ok.com wrote: 15.05.2013 11:18, Andrew Beekhof wrote: On 15/05/2013, at 5:31 PM, Vladislav Bogdanov bub...@hoster-ok.com wrote: 15.05.2013 10:25, Andrew Beekhof wrote: On 15/05/2013, at 3:50 PM, Vladislav Bogdanov bub...@hoster-ok.com wrote: 15.05.2013 08:23, Andrew Beekhof wrote: On 15/05/2013, at 3:11 PM, renayama19661...@ybb.ne.jp wrote: Hi Andrew, Thank you for comments. The guest located it to the shared disk. What is on the shared disk? The whole OS or app-specific data (i.e. nothing pacemaker needs directly)? Shared disk has all the OS and the all data. Oh. I can imagine that being problematic. Pacemaker really isn't designed to function without disk access. You might be able to get away with it if you turn off saving PE files to disk though. I store CIB and PE files to tmpfs, and sync them to remote storage (CIFS) with lsyncd level 1 config (I may share it on request). It copies critical data like cib.xml, and moves everything else, symlinking it to original place. The same technique may apply here, but with local fs instead of cifs. Btw, the following patch is needed for that, otherwise pacemaker overwrites remote files instead of creating new ones on tmpfs: --- a/lib/common/xml.c 2011-02-11 11:42:37.0 +0100 +++ b/lib/common/xml.c 2011-02-24 15:07:48.541870829 +0100 @@ -529,6 +529,8 @@ write_file(const char *string, const char *filename) return -1; } +unlink(filename); Seems like it should be safe to include for normal operation. Exactly. Small flaw in that logic... write_file() is not used anywhere. Heh, thanks for spotting this. I recall write_file() was used for pengine, but some other function for CIB. You probably optimized that but forgot to remove unused function, that's why I was sure patch is still valid. And I did tests (CIFS storage outage simulation) only after initial patch, but not last years, that's why I didn't notice the regression - storage uses pacemaker too ;) . This should go to write_xml_file() (And probably to other places just before fopen(..., w), f.e. series). I've consolidated the code, however adding the unlink() would break things for anyone intentionally symlinking cib.xml from somewhere else (like a git repo). So I'm not so sure I should make the unlink() change :( Agree. I originally made it specific to pengine files. What do you prefer, simple wrapper in xml.c (f.e. unlink_and_write_xml_file()) or just add unlink() call to pengine before it calls write_xml_file()? The last one :) ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org