Re: [Pacemaker] WG: time pressure - software raid cluster, raid1 ressource agent, help needed
On Sun, 2011-03-06 at 12:40 +0100, patrik.rappo...@knapp.com wrote: Hi, assume the basic problem is in your raid configuration. If you unmap one box the devices should not be in status FAIL but in degraded. So what is the exit status of mdadm --detail --test /dev/md0 after unmapping ? Furthermore I would start start with one isolated group containing the raid, LVM, and FS to keep it simple. Regards Holger Hy, does anyone have an idea to that? I only have the servers till next week friday, so to my regret I am under time pressure :( Like I already wrote, I would appreciate and test any idea of you. Also if someone already made clusters with lvm-mirror, I would be happy to get a cib or some configuration examples. Thank you very much in advance. kr Patrik patrik.rappo...@knapp.com 03.03.2011 15:11Bitte antworten anThe Pacemaker cluster resource manager An pacemaker@oss.clusterlabs.org Kopie Blindkopie Thema [Pacemaker] software raid cluster, raid1 ressource agent,help needed Good Day, I have a 2 node active/passive cluster which is connected to two ibm 4700 storages. I configured 3 raids and I use the Raid1 ressource agent for managing the Raid1s in the cluster. When I now disable the mapping of one storage, to simulate the fail of one storage, the Raid1 Ressources change to the State FAILED and the second node then takes over the ressources and is able to start the raid devices. So I am confused, why the active node can't keep the raid1 ressources and the former passive node takes them over and can start them correct. I would really appreciate your advice, or maybe someone already has a example configuration for Raid1 with two storages. Thank you very much in advance. Attached you can find my cib.xml. kr Patrik Mit freundlichen Grüßen / Best Regards Patrik Rapposch, BSc System Administration KNAPP Systemintegration GmbH Waltenbachstraße 9 8700 Leoben, Austria Phone: +43 3842 805-915 Fax: +43 3842 805-500 patrik.rappo...@knapp.com www.KNAPP.com Commercial register number: FN 138870x Commercial register court: Leoben The information in this e-mail (including any attachment) is confidential and intended to be for the use of the addressee(s) only. If you have received the e-mail by mistake, any disclosure, copy, distribution or use of the contents of the e-mail is prohibited, and you must delete the e-mail from your system. As e-mail can be changed electronically KNAPP assumes no responsibility for any alteration to this e-mail or its attachments. KNAPP has taken every reasonable precaution to ensure that any attachment to this e-mail has been swept for virus. However, KNAPP does not accept any liability for damage sustained as a result of such attachment being virus infected and strongly recommend that you carry out your own virus check before opening any attachment. ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
[Pacemaker] Failure after intermittent network outage
Hi everyone. We have three-node cluster using Pacemaker 1.0.10 on RHEL5.5. Two nodes (wapgw1-1 and wapgw1-2) are configured for DRBD and running virtual machines over it. Third node (wapgw1-log) is mostly a quorum server, i.e. it has not libvirtd nor DRBD installed. There are location constraints which allow resources to run on real nodes only. All three nodes are connected to network over bonded links at active-backup mode. STONITH had been configured but unavailable at the moment. It's bad, I know. The problem come when one of two interfaces on quorum node (wapgw1-log) went down. It was not first time, and previously this did not cause any harm. Corosync has lost connectivity, cluster has fallen into partitions. Mar 1 11:15:58 wapgw1-log corosync[24536]: [TOTEM ] A processor failed, forming new configuration. Mar 1 11:15:59 wapgw1-log corosync[24536]: [pcmk ] notice: pcmk_peer_update: Transitional membership event on ring 3500: memb=1, new=0, lost=2 Mar 1 11:15:59 wapgw1-log corosync[24536]: [pcmk ] info: pcmk_peer_update: memb: wapgw1-log 813454090 Mar 1 11:15:59 wapgw1-log corosync[24536]: [pcmk ] info: pcmk_peer_update: lost: wapgw1-1 1098666762 Mar 1 11:15:59 wapgw1-log corosync[24536]: [pcmk ] info: pcmk_peer_update: lost: wapgw1-2 1115443978 Mar 1 11:15:59 wapgw1-log corosync[24536]: [pcmk ] notice: pcmk_peer_update: Stable membership event on ring 3500: memb=1, new=0, lost=0 Mar 1 11:15:59 wapgw1-log corosync[24536]: [pcmk ] info: pcmk_peer_update: MEMB: wapgw1-log 813454090 Mar 1 11:15:59 wapgw1-log corosync[24536]: [pcmk ] info: ais_mark_unseen_peer_dead: Node wapgw1-2 was not seen in the previous transition Mar 1 11:15:59 wapgw1-log corosync[24536]: [pcmk ] info: update_member: Node 1115443978/wapgw1-2 is now: lost Mar 1 11:15:59 wapgw1-log corosync[24536]: [pcmk ] info: ais_mark_unseen_peer_dead: Node wapgw1-1 was not seen in the previous transition Mar 1 11:15:59 wapgw1-log corosync[24536]: [pcmk ] info: update_member: Node 1098666762/wapgw1-1 is now: lost Mar 1 11:15:59 wapgw1-log corosync[24536]: [pcmk ] info: send_member_notification: Sending membership update 3500 to 2 children Mar 1 11:15:59 wapgw1-log crmd: [24547]: notice: ais_dispatch: Membership 3500: quorum lost Mar 1 11:15:59 wapgw1-log corosync[24536]: [TOTEM ] A processor joined or left the membership and a new membership was formed. Mar 1 11:15:59 wapgw1-log crmd: [24547]: info: crm_update_peer: Node wapgw1-2: id=1115443978 state=lost (new) addr=r(0) ip(10.83.124.66) votes=1 born=3400 seen=3496 proc=00013312 Mar 1 11:15:59 wapgw1-log cib: [24543]: notice: ais_dispatch: Membership 3500: quorum lost Mar 1 11:15:59 wapgw1-log crmd: [24547]: info: crm_update_peer: Node wapgw1-1: id=1098666762 state=lost (new) addr=r(0) ip(10.83.124.65) votes=1 born=3404 seen=3496 proc=00013312 Mar 1 11:15:59 wapgw1-log cib: [24543]: info: crm_update_peer: Node wapgw1-2: id=1115443978 state=lost (new) addr=r(0) ip(10.83.124.66) votes=1 born=3400 seen=3496 proc=00013312 Mar 1 11:15:59 wapgw1-log crmd: [24547]: WARN: check_dead_member: Our DC node (wapgw1-2) left the cluster Mar 1 11:15:59 wapgw1-log crmd: [24547]: info: do_state_transition: State transition S_NOT_DC - S_ELECTION [ input=I_ELECTION cause=C_FSA_INTERNAL origin=check_dead_member ] Mar 1 11:15:59 wapgw1-log cib: [24543]: info: crm_update_peer: Node wapgw1-1: id=1098666762 state=lost (new) addr=r(0) ip(10.83.124.65) votes=1 born=3404 seen=3496 proc=00013312 Mar 1 11:15:59 wapgw1-log crmd: [24547]: info: update_dc: Unset DC wapgw1-2 Mar 1 11:15:59 wapgw1-log corosync[24536]: [MAIN ] Completed service synchronization, ready to provide service. Mar 1 11:15:59 wapgw1-log crmd: [24547]: info: do_state_transition: State transition S_ELECTION - S_INTEGRATION [ input=I_ELECTION_DC cause=C_FSA_INTERNAL origin=do_election_check ] Mar 1 11:15:59 wapgw1-log crmd: [24547]: info: do_te_control: Registering TE UUID: 1be865f6-557d-45c4-b549-c10dbab5acc4 Mar 1 11:15:59 wapgw1-log crmd: [24547]: WARN: cib_client_add_notify_callback: Callback already present Mar 1 11:15:59 wapgw1-log crmd: [24547]: info: set_graph_functions: Setting custom graph functions Mar 1 11:15:59 wapgw1-log crmd: [24547]: info: unpack_graph: Unpacked transition -1: 0 actions in 0 synapses Mar 1 11:15:59 wapgw1-log crmd: [24547]: info: do_dc_takeover: Taking over DC status for this partition Mar 1 11:15:59 wapgw1-log cib: [24543]: info: cib_process_readwrite: We are now in R/W mode DC node has noticed member loss: Mar 1 11:15:59 wapgw1-2 pengine: [5748]: WARN: pe_fence_node: Node wapgw1-log will be fenced because it is un-expectedly down Mar 1 11:15:59 wapgw1-2 pengine: [5748]: info: determine_online_status_fencing: ha_state=active, ccm_state=false, crm_state=online, join_state=member,
[Pacemaker] Help with batch import and resource distribution
Hi all, I'm creating my pacemaker configuration from a script executed by chef and I'm having some issues. I have 3 init scripts that run the following services. haproxy nginx ec2setip chef-client I would like the following distribution. Single Node: haproxy and ec2setip All other Nodes: nginx. All nodes: chef-client Essentially, I use ha proxy for load balancing. I use nginx for ssl decryption and serving static pages, so I want that to run on every node that isn't the ha proxy node. During a failover, I want haproxy to be started, and ec2setip to be run on a single node, and all other nodes to start nginx. I'm not using STONITH on purpose. If one node takes over the IP and another is running, it does not affect my service since none of my clustered services perform any data write. I'm using the following configuration, and importing it with this command. crm configure /tmp/proxyfailover.txt here is the content of /tmp/proxyfailover.txt BEGIN FILE -- property stonith-enabled=false primitive haproxy lsb:/etc/init.d/haproxy primitive nginx lsb:/etc/init.d/nginx primitive ec2setip lsb:/etc/init.d/ec2setip primitive chefclient lsb:/etc/init.d/chef order nginx-after-haproxy inf: haproxy nginx order ec2setip-after-nginx inf: nginx ec2setip order chefclient-after-ec2setip inf: ec2setip chefclient commit END FILE -- I've read this section, but I'm a bit lost. Any help would be greatly appreciated. http://www.clusterlabs.org/doc/en-US/Pacemaker/1.0/html/Pacemaker_Explained/s-resource-sets-collocation.html#id580996 Thanks, Todd ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker