Re: [Pacemaker] Routing-Ressources on a 2-Node-Cluster
Hi Devin, thank you very much for your answer. If you insist on trying to do this with just the Linux-HA cluster, I don't have any suggestions as to how you should proceed. I know that the construct we are building is quite complicated. The problem is, that the active network (10.20.10.x) is too small to cover both locations (in real we only have a /26 subnet available) and we can not change/move this (public) addresses to an other range. In addition a NAT-translation over a router is not possible, the servers have to be accessible directly via their public ip-address, that has to be the cluster-ip. So we have to deal with two different networks in the two locations (10.20.11.x/10.20.12.x) and create an overlay for the current ip-addresses :-( The current status is, that I added a modified the heartbeat-Route2-script that also allows the metric as parameter and with this it works as expected. But I am really fighting with the new corosync/pacemaker, for me the old heartbeat (1) was much easier and gave me all functionality I needed. So I make progress, but quite/too slow ... -- To Answer please replace invalid with de ! Zum Antworten bitte invalid durch de ersetzen ! Chau y hasta luego, Thorolf ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] pcs equivalent of crm configure erase
On 2013-04-21T09:57:02, Andrew Beekhof and...@beekhof.net wrote: Because the author wanted it removed from Pacemaker, at which point someone would have needed to navigate the red tape to get it back into the distribution. It is easier to include a new package in Fedora than to manage a package split? I had no idea. Regards, Lars -- Architect Storage/HA SUSE LINUX Products GmbH, GF: Jeff Hawn, Jennifer Guild, Felix Imendörffer, HRB 21284 (AG Nürnberg) Experience is the name everyone gives to their mistakes. -- Oscar Wilde ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] iscsi target mounting readonly on client
Hi Felix, Sorry for the late answer. I gave a try to your suggestions, it worked immediately with block io and the increase block timeout. I stick with iet for now as my supervisor want to stick to standard pacakges. You really save my day. Best reagards, Joseph-André GUARAGNA 2013/4/14 Joseph-Andre Guaragna joseph-an...@rdmo.com Thanks for the informations. I give it a try monday and gave you a feedback. 2013/4/12 Felix Zachlod fz.li...@sis-gmbh.info: Hello Joseph! -Ursprüngliche Nachricht-, Von: Joseph-Andre Guaragna [mailto:joseph-an...@rdmo.com] Gesendet: Freitag, 12. April 2013 17:19 An: pacemaker@oss.clusterlabs.org Betreff: [Pacemaker] iscsi target mounting readonly on client You have to make two things absolutely shure. 1. Data that has been acknowledged by you iscsi Target to your initiator has hit the device and not only the page cache! If you run your target in fileio mode you have to use write trough- cause with write back you or your cluster manager can't ever tell if the writes have completed before switching the DRBD states. That will only perform good if you have a decent raid card with BBWC! BUT YOU MUST RUN WRITE TRHOUGH or blockio (which will be write trough too) running write back in such a constellation IS NOT SAFE you might risk SERIOUS DATA CORRUPTION when switching targets. 2. On your initiator side try to rise the /sys/block/sd*/device/timeout value. That is the time the block device will wait for a command to complete before handing an i/o error tot he upper layer- which will most probably lead to your filesystem remounting r/o. 3. This is just a side note: do not use iet. We were running a production target wit iet for about 2 year which caused horrible problems to us. Consider scst or lio (I personally do not have any experiences with lio but scst is running in our production environment for years now without any problems) regards Felix ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] Enabling debugging with cman, corosync, pacemaker at runtime
On 23/04/2013, at 2:51 PM, Andreas Mock andreas.m...@web.de wrote: Hi Andrew, thank you for that hint. no problem. the basic idea is to be quiet by default but allow access to extreme levels of verbosity should the need arise :) Best regards Andreas Mock -Ursprüngliche Nachricht- Von: Andrew Beekhof [mailto:and...@beekhof.net] Gesendet: Dienstag, 23. April 2013 01:45 An: The Pacemaker cluster resource manager Betreff: Re: [Pacemaker] Enabling debugging with cman, corosync, pacemaker at runtime Depending on your version of pacemaker you can do... # Enable trace logging (if it isn't already) killall -USR1 process_name # Dump trace logging to disk killall -TRAP process_name # Find out what file it was dumped to grep blackbox /var/log/messages # Read it qb-blackbox /path/to/file Subsequent calls to killall -TRAP ... will have only logs since the last dump. On 23/04/2013, at 2:41 AM, Andreas Mock andreas.m...@web.de wrote: Hi all, is there a way to enable debug output on a cman, corosync, pacemaker stack without restarting the whole cman stuff. I've found logging debug=on/ for cluster.conf, assuming that this determines the value of the config-db-keys cluster.logging.debug=on logging.debug=on Is it enough to write new values to these keys? Or do I have to notify one or several processes to react on this change? Best regards Andreas Mock ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] fail over just failed
crit: get_timet_now: Defaulting to 'now' What does this message mean? It means some idiot (me) forgot to downgrade a debug message before committing. You can safely ignore it. critical important debug message ignored. Thanks for owning up to it. Were there any other issues resulting from the connectivity loss? not as far as I can tell. I didn't determine which machines didn't have network connection at the time so anything I can infer is just a little too speculative. I can't think of a scenario where a DC-less cluster can keep resources like IPs up without relying on the default router(s) to make the right choice. ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
[Pacemaker] why so long to stonith?
Using pacemaker 1.1.8 on RHEL 6.4, I did a test where I just killed (-KILL) corosync on a peer node. Pacemaker seemed to take a long time to transition to stonithing it though after noticing it was AWOL: Apr 23 19:05:20 node2 corosync[1324]: [TOTEM ] A processor failed, forming new configuration. Apr 23 19:05:21 node2 corosync[1324]: [pcmk ] notice: pcmk_peer_update: Transitional membership event on ring 188: memb=1, new=0, lost=1 Apr 23 19:05:21 node2 corosync[1324]: [pcmk ] info: pcmk_peer_update: memb: node2 2608507072 Apr 23 19:05:21 node2 corosync[1324]: [pcmk ] info: pcmk_peer_update: lost: node1 4252674240 Apr 23 19:05:21 node2 corosync[1324]: [pcmk ] notice: pcmk_peer_update: Stable membership event on ring 188: memb=1, new=0, lost=0 Apr 23 19:05:21 node2 corosync[1324]: [pcmk ] info: pcmk_peer_update: MEMB: node2 2608507072 Apr 23 19:05:21 node2 corosync[1324]: [pcmk ] info: ais_mark_unseen_peer_dead: Node node1 was not seen in the previous transition Apr 23 19:05:21 node2 corosync[1324]: [pcmk ] info: update_member: Node 4252674240/node1 is now: lost Apr 23 19:05:21 node2 corosync[1324]: [pcmk ] info: send_member_notification: Sending membership update 188 to 2 children Apr 23 19:05:21 node2 corosync[1324]: [TOTEM ] A processor joined or left the membership and a new membership was formed. Apr 23 19:05:21 node2 corosync[1324]: [CPG ] chosen downlist: sender r(0) ip(192.168.122.155) ; members(old:2 left:1) Apr 23 19:05:21 node2 corosync[1324]: [MAIN ] Completed service synchronization, ready to provide service. Apr 23 19:05:21 node2 crmd[1634]: notice: ais_dispatch_message: Membership 188: quorum lost Apr 23 19:05:21 node2 crmd[1634]: notice: crm_update_peer_state: crm_update_ais_node: Node node1[4252674240] - state is now lost Apr 23 19:05:21 node2 cib[1629]: notice: ais_dispatch_message: Membership 188: quorum lost Apr 23 19:05:21 node2 cib[1629]: notice: crm_update_peer_state: crm_update_ais_node: Node node1[4252674240] - state is now lost Apr 23 19:08:31 node2 crmd[1634]: notice: do_state_transition: State transition S_IDLE - S_POLICY_ENGINE [ input=I_PE_CALC cause=C_TIMER_POPPED origin=crm_timer_popped ] Apr 23 19:08:31 node2 pengine[1633]: notice: unpack_config: On loss of CCM Quorum: Ignore Apr 23 19:08:31 node2 pengine[1633]: warning: pe_fence_node: Node node1 will be fenced because the node is no longer part of the cluster Apr 23 19:08:31 node2 pengine[1633]: warning: determine_online_status: Node node1 is unclean Apr 23 19:08:31 node2 pengine[1633]: warning: custom_action: Action MGS_e4a31b_stop_0 on node1 is unrunnable (offline) Apr 23 19:08:31 node2 pengine[1633]: warning: stage6: Scheduling Node node1 for STONITH Apr 23 19:08:31 node2 pengine[1633]: notice: LogActions: Move MGS_e4a31b#011(Started node1 - node2) Apr 23 19:08:31 node2 crmd[1634]: notice: te_fence_node: Executing reboot fencing operation (15) on node1 (timeout=6) Apr 23 19:08:31 node2 stonith-ng[1630]: notice: handle_request: Client crmd.1634.642b9c6e wants to fence (reboot) 'node1' with device '(any)' Apr 23 19:08:31 node2 stonith-ng[1630]: notice: initiate_remote_stonith_op: Initiating remote operation reboot for node1: fb431eb4-789c-41bc-903e-4041d50e93b4 (0) Apr 23 19:08:31 node2 pengine[1633]: warning: process_pe_message: Calculated Transition 115: /var/lib/pacemaker/pengine/pe-warn-7.bz2 Apr 23 19:08:41 node2 stonith-ng[1630]: notice: log_operation: Operation 'reboot' [27682] (call 0 from crmd.1634) for host 'node1' with device 'st-node1' returned: 0 (OK) Apr 23 19:08:41 node2 stonith-ng[1630]: notice: remote_op_done: Operation reboot of node1 by node2 for crmd.1634@node2.fb431eb4: OK Apr 23 19:08:41 node2 crmd[1634]: notice: tengine_stonith_callback: Stonith operation 3/15:115:0:c118573c-84e3-48bd-8dc9-40de24438385: OK (0) Apr 23 19:08:41 node2 crmd[1634]: notice: tengine_stonith_notify: Peer node1 was terminated (st_notify_fence) by node2 for node2: OK (ref=fb431eb4-789c-41bc-903e-4041d50e93b4) by client crmd.1634 Apr 23 19:09:03 node2 corosync[1324]: [pcmk ] notice: pcmk_peer_update: Transitional membership event on ring 192: memb=1, new=0, lost=0 Apr 23 19:09:03 node2 corosync[1324]: [pcmk ] info: pcmk_peer_update: memb: node2 2608507072 Apr 23 19:09:03 node2 corosync[1324]: [pcmk ] notice: pcmk_peer_update: Stable membership event on ring 192: memb=2, new=1, lost=0 Apr 23 19:09:03 node2 corosync[1324]: [pcmk ] info: update_member: Node 4252674240/node1 is now: member Apr 23 19:09:03 node2 corosync[1324]: [pcmk ] info: pcmk_peer_update: NEW: node1 4252674240 Apr 23 19:09:03 node2 corosync[1324]: [pcmk ] info: pcmk_peer_update: MEMB: node2 2608507072 Apr 23 19:09:03 node2 corosync[1324]: [pcmk ] info: pcmk_peer_update: MEMB: node1 4252674240 Apr 23 19:09:03 node2 corosync[1324]: [pcmk ] info: send_member_notification: Sending membership update 192 to 2 children Apr 23
[Pacemaker] Pacemaker core dumps
Hello everyone, Below is a message that my colleague tried to post to the mailing list, but somehow it didn't make it yet, so I'm re-posting just in case it got lost somewhere: Hi All, I currently have a pacemaker + drbd cluster. I am getting core dumps every 15 minutes. I can't seem to figure out what has triggered this. How do you see what is being output in the core dump? Also, in my corosync.log I found the following errors: Apr 22 10:37:15 [2058] crmd:error: crm_xml_err: XML Error: Entity: line 1: parser error : invalid character in attribute value Apr 22 10:37:15 [2058] crmd:error: crm_xml_err: XML Error: e-9c6d-44bf-963b-87836613e694 lrmd_rsc_output=tomcat6 (pid 8899) is running... Apr 22 10:37:15 [2058] crmd:error: crm_xml_err: XML Error: ^ Apr 22 10:37:15 [2058] crmd:error: crm_xml_err: XML Error: Entity: line 1: parser error : attributes construct error Apr 22 10:37:15 [2058] crmd:error: crm_xml_err: XML Error: e-9c6d-44bf-963b-87836613e694 lrmd_rsc_output=tomcat6 (pid 8899) is running... Apr 22 10:37:15 [2058] crmd:error: crm_xml_err: XML Error: ^ Apr 22 10:37:15 [2058] crmd:error: crm_xml_err: XML Error: Entity: line 1: parser error : Couldn't find end of Start Tag lrmd_notify line 1 Apr 22 10:37:15 [2058] crmd:error: crm_xml_err: XML Error: e-9c6d-44bf-963b-87836613e694 lrmd_rsc_output=tomcat6 (pid 8899) is running... Apr 22 10:37:15 [2058] crmd:error: crm_xml_err: XML Error: ^ Apr 22 10:37:15 [2058] crmd:error: crm_xml_err: XML Error: Entity: line 1: parser error : Extra content at the end of the document Apr 22 10:37:15 [2058] crmd:error: crm_xml_err: XML Error: e-9c6d-44bf-963b-87836613e694 lrmd_rsc_output=tomcat6 (pid 8899) is running... Apr 22 10:37:15 [2058] crmd:error: crm_xml_err: XML Error: ^ Apr 22 10:37:15 [2058] crmd: warning: string2xml: Parsing failed (domain=1, level=3, code=5): Extra content at the end of the document Apr 22 10:37:15 [2058] crmd: warning: string2xml: String start: lrmd_notify lrmd_origin=send_cmd_complete_notify Apr 22 10:37:15 [2058] crmd: warning: string2xml: String start+645: =3 CRM_meta_interval=15000//lrmd_notify Apr 22 10:37:15 [2058] crmd:error: crm_abort:string2xml: Forked child 22964 to record non-fatal assert at xml.c:605 : String parsing error So I thought the errors might be in my xml config file so I extracted the file and found no errors in there. From the log file, this string e-9c6d-44bf-963b-87836613e694 seems to be part of the transition-key. I am not sure if that helps shed some light on the issue but any help would be appreciated. [Description: Description: cid:D85E51EA-D618-4CBC-9F88-34F696123DED] Xavier Lashmar Analyste de Systèmes | Systems Analyst Service étudiants, service de l'informatique et des communications/Student services, computing and communications services. 1 Nicholas Street (810) Ottawa ON K1N 7B7 Tél. | Tel. 613-562-5800 (2120) inline: image001.pnginline: image002.pnginline: image003.png___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] why so long to stonith?
As I understand it, this is a known issue with the 1.1.8 release. I believe that 1.1.9 is now available from the pacemaker repos and it should fix the problem. digimer On 04/23/2013 03:34 PM, Brian J. Murrell wrote: Using pacemaker 1.1.8 on RHEL 6.4, I did a test where I just killed (-KILL) corosync on a peer node. Pacemaker seemed to take a long time to transition to stonithing it though after noticing it was AWOL: Apr 23 19:05:20 node2 corosync[1324]: [TOTEM ] A processor failed, forming new configuration. Apr 23 19:05:21 node2 corosync[1324]: [pcmk ] notice: pcmk_peer_update: Transitional membership event on ring 188: memb=1, new=0, lost=1 Apr 23 19:05:21 node2 corosync[1324]: [pcmk ] info: pcmk_peer_update: memb: node2 2608507072 Apr 23 19:05:21 node2 corosync[1324]: [pcmk ] info: pcmk_peer_update: lost: node1 4252674240 Apr 23 19:05:21 node2 corosync[1324]: [pcmk ] notice: pcmk_peer_update: Stable membership event on ring 188: memb=1, new=0, lost=0 Apr 23 19:05:21 node2 corosync[1324]: [pcmk ] info: pcmk_peer_update: MEMB: node2 2608507072 Apr 23 19:05:21 node2 corosync[1324]: [pcmk ] info: ais_mark_unseen_peer_dead: Node node1 was not seen in the previous transition Apr 23 19:05:21 node2 corosync[1324]: [pcmk ] info: update_member: Node 4252674240/node1 is now: lost Apr 23 19:05:21 node2 corosync[1324]: [pcmk ] info: send_member_notification: Sending membership update 188 to 2 children Apr 23 19:05:21 node2 corosync[1324]: [TOTEM ] A processor joined or left the membership and a new membership was formed. Apr 23 19:05:21 node2 corosync[1324]: [CPG ] chosen downlist: sender r(0) ip(192.168.122.155) ; members(old:2 left:1) Apr 23 19:05:21 node2 corosync[1324]: [MAIN ] Completed service synchronization, ready to provide service. Apr 23 19:05:21 node2 crmd[1634]: notice: ais_dispatch_message: Membership 188: quorum lost Apr 23 19:05:21 node2 crmd[1634]: notice: crm_update_peer_state: crm_update_ais_node: Node node1[4252674240] - state is now lost Apr 23 19:05:21 node2 cib[1629]: notice: ais_dispatch_message: Membership 188: quorum lost Apr 23 19:05:21 node2 cib[1629]: notice: crm_update_peer_state: crm_update_ais_node: Node node1[4252674240] - state is now lost Apr 23 19:08:31 node2 crmd[1634]: notice: do_state_transition: State transition S_IDLE - S_POLICY_ENGINE [ input=I_PE_CALC cause=C_TIMER_POPPED origin=crm_timer_popped ] Apr 23 19:08:31 node2 pengine[1633]: notice: unpack_config: On loss of CCM Quorum: Ignore Apr 23 19:08:31 node2 pengine[1633]: warning: pe_fence_node: Node node1 will be fenced because the node is no longer part of the cluster Apr 23 19:08:31 node2 pengine[1633]: warning: determine_online_status: Node node1 is unclean Apr 23 19:08:31 node2 pengine[1633]: warning: custom_action: Action MGS_e4a31b_stop_0 on node1 is unrunnable (offline) Apr 23 19:08:31 node2 pengine[1633]: warning: stage6: Scheduling Node node1 for STONITH Apr 23 19:08:31 node2 pengine[1633]: notice: LogActions: Move MGS_e4a31b#011(Started node1 - node2) Apr 23 19:08:31 node2 crmd[1634]: notice: te_fence_node: Executing reboot fencing operation (15) on node1 (timeout=6) Apr 23 19:08:31 node2 stonith-ng[1630]: notice: handle_request: Client crmd.1634.642b9c6e wants to fence (reboot) 'node1' with device '(any)' Apr 23 19:08:31 node2 stonith-ng[1630]: notice: initiate_remote_stonith_op: Initiating remote operation reboot for node1: fb431eb4-789c-41bc-903e-4041d50e93b4 (0) Apr 23 19:08:31 node2 pengine[1633]: warning: process_pe_message: Calculated Transition 115: /var/lib/pacemaker/pengine/pe-warn-7.bz2 Apr 23 19:08:41 node2 stonith-ng[1630]: notice: log_operation: Operation 'reboot' [27682] (call 0 from crmd.1634) for host 'node1' with device 'st-node1' returned: 0 (OK) Apr 23 19:08:41 node2 stonith-ng[1630]: notice: remote_op_done: Operation reboot of node1 by node2 for crmd.1634@node2.fb431eb4: OK Apr 23 19:08:41 node2 crmd[1634]: notice: tengine_stonith_callback: Stonith operation 3/15:115:0:c118573c-84e3-48bd-8dc9-40de24438385: OK (0) Apr 23 19:08:41 node2 crmd[1634]: notice: tengine_stonith_notify: Peer node1 was terminated (st_notify_fence) by node2 for node2: OK (ref=fb431eb4-789c-41bc-903e-4041d50e93b4) by client crmd.1634 Apr 23 19:09:03 node2 corosync[1324]: [pcmk ] notice: pcmk_peer_update: Transitional membership event on ring 192: memb=1, new=0, lost=0 Apr 23 19:09:03 node2 corosync[1324]: [pcmk ] info: pcmk_peer_update: memb: node2 2608507072 Apr 23 19:09:03 node2 corosync[1324]: [pcmk ] notice: pcmk_peer_update: Stable membership event on ring 192: memb=2, new=1, lost=0 Apr 23 19:09:03 node2 corosync[1324]: [pcmk ] info: update_member: Node 4252674240/node1 is now: member Apr 23 19:09:03 node2 corosync[1324]: [pcmk ] info: pcmk_peer_update: NEW: node1 4252674240 Apr 23 19:09:03 node2 corosync[1324]: [pcmk ] info: pcmk_peer_update: MEMB: node2 2608507072 Apr 23