Re: [Pacemaker] WG: time pressure - software raid cluster, raid1 ressource agent, help needed

2011-03-06 Thread Holger Teutsch
On Sun, 2011-03-06 at 12:40 +0100, patrik.rappo...@knapp.com wrote:
Hi,
assume the basic problem is in your raid configuration.

If you unmap one box the devices should not be in status FAIL but in
degraded.

So what is the exit status of

mdadm --detail --test /dev/md0

after unmapping ?

Furthermore I would start start with one isolated group containing the
raid, LVM, and FS to keep it simple.

Regards
Holger

  Hy, 
 
 
 does anyone have an idea to that? I only have the servers till next
 week friday, so to my regret I am under time pressure :(
 
 
 
 Like I already wrote, I would appreciate and test any idea of you.
 Also if someone already made clusters with lvm-mirror, I would be
 happy to get a cib or some configuration examples.
 
  
 
 
 
 
 
 Thank you very much in advance.
 
  
 
 
 
 
 
 kr Patrik
 
 
 
 
 
 patrik.rappo...@knapp.com
 03.03.2011 15:11Bitte antworten anThe Pacemaker cluster resource
 manager
 
 An   pacemaker@oss.clusterlabs.org
 Kopie   
 Blindkopie   
 Thema   [Pacemaker] software raid cluster, raid1 ressource agent,help
 needed
 
 
 Good Day, 
 
 I have a 2 node active/passive cluster which is connected to two  ibm
 4700 storages. I configured 3 raids and I use the Raid1 ressource
 agent for managing the Raid1s in the cluster. 
 When I now disable the mapping of one storage, to simulate the fail of
 one storage, the Raid1 Ressources change to the State FAILED and the
 second node then takes over the ressources and is able to start the
 raid devices. 
 
 So I am confused, why the active node can't keep the raid1 ressources
 and the former passive node takes them over and can start them
 correct. 
 
 I would really appreciate your advice, or maybe someone already has a
 example configuration for Raid1 with two storages.
 
 Thank you very much in advance. Attached you can find my cib.xml. 
 
 kr Patrik 
 
 
 
 Mit freundlichen Grüßen / Best Regards
 
 Patrik Rapposch, BSc
 System Administration
 
 KNAPP Systemintegration GmbH
 Waltenbachstraße 9
 8700 Leoben, Austria 
 Phone: +43 3842 805-915
 Fax: +43 3842 805-500
 patrik.rappo...@knapp.com 
 www.KNAPP.com 
 
 Commercial register number: FN 138870x
 Commercial register court: Leoben
 
 The information in this e-mail (including any attachment) is
 confidential and intended to be for the use of the addressee(s) only.
 If you have received the e-mail by mistake, any disclosure, copy,
 distribution or use of the contents of the e-mail is prohibited, and
 you must delete the e-mail from your system. As e-mail can be changed
 electronically KNAPP assumes no responsibility for any alteration to
 this e-mail or its attachments. KNAPP has taken every reasonable
 precaution to ensure that any attachment to this e-mail has been swept
 for virus. However, KNAPP does not accept any liability for damage
 sustained as a result of such attachment being virus infected and
 strongly recommend that you carry out your own virus check before
 opening any attachment.
 ___
 Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker
 
 Project Home: http://www.clusterlabs.org
 Getting started:
 http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
 Bugs:
 http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
 
 
 ___
 Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker
 
 Project Home: http://www.clusterlabs.org
 Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
 Bugs: 
 http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker



___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


[Pacemaker] Failure after intermittent network outage

2011-03-06 Thread Pavel Levshin

Hi everyone.

We have three-node cluster using Pacemaker 1.0.10 on RHEL5.5. Two nodes 
(wapgw1-1 and wapgw1-2) are configured for DRBD and running virtual 
machines over it. Third node (wapgw1-log) is mostly a quorum server, 
i.e. it has not libvirtd nor DRBD installed. There are location 
constraints which allow resources to run on real nodes only.


All three nodes are connected to network over bonded links at 
active-backup mode.


STONITH had been configured but unavailable at the moment. It's bad, I know.

The problem come when one of two interfaces on quorum node (wapgw1-log) 
went down. It was not first time, and previously this did not cause any 
harm.


Corosync has lost connectivity, cluster has fallen into partitions.

Mar  1 11:15:58 wapgw1-log corosync[24536]:   [TOTEM ] A processor 
failed, forming new configuration.
Mar  1 11:15:59 wapgw1-log corosync[24536]:   [pcmk  ] notice: 
pcmk_peer_update: Transitional membership event on ring 3500: memb=1, 
new=0, lost=2
Mar  1 11:15:59 wapgw1-log corosync[24536]:   [pcmk  ] info: 
pcmk_peer_update: memb: wapgw1-log 813454090
Mar  1 11:15:59 wapgw1-log corosync[24536]:   [pcmk  ] info: 
pcmk_peer_update: lost: wapgw1-1 1098666762
Mar  1 11:15:59 wapgw1-log corosync[24536]:   [pcmk  ] info: 
pcmk_peer_update: lost: wapgw1-2 1115443978
Mar  1 11:15:59 wapgw1-log corosync[24536]:   [pcmk  ] notice: 
pcmk_peer_update: Stable membership event on ring 3500: memb=1, new=0, 
lost=0
Mar  1 11:15:59 wapgw1-log corosync[24536]:   [pcmk  ] info: 
pcmk_peer_update: MEMB: wapgw1-log 813454090
Mar  1 11:15:59 wapgw1-log corosync[24536]:   [pcmk  ] info: 
ais_mark_unseen_peer_dead: Node wapgw1-2 was not seen in the previous 
transition
Mar  1 11:15:59 wapgw1-log corosync[24536]:   [pcmk  ] info: 
update_member: Node 1115443978/wapgw1-2 is now: lost
Mar  1 11:15:59 wapgw1-log corosync[24536]:   [pcmk  ] info: 
ais_mark_unseen_peer_dead: Node wapgw1-1 was not seen in the previous 
transition
Mar  1 11:15:59 wapgw1-log corosync[24536]:   [pcmk  ] info: 
update_member: Node 1098666762/wapgw1-1 is now: lost
Mar  1 11:15:59 wapgw1-log corosync[24536]:   [pcmk  ] info: 
send_member_notification: Sending membership update 3500 to 2 children
Mar  1 11:15:59 wapgw1-log crmd: [24547]: notice: ais_dispatch: 
Membership 3500: quorum lost
Mar  1 11:15:59 wapgw1-log corosync[24536]:   [TOTEM ] A processor 
joined or left the membership and a new membership was formed.
Mar  1 11:15:59 wapgw1-log crmd: [24547]: info: crm_update_peer: Node 
wapgw1-2: id=1115443978 state=lost (new) addr=r(0) ip(10.83.124.66)  
votes=1 born=3400 seen=3496 proc=00013312
Mar  1 11:15:59 wapgw1-log cib: [24543]: notice: ais_dispatch: 
Membership 3500: quorum lost
Mar  1 11:15:59 wapgw1-log crmd: [24547]: info: crm_update_peer: Node 
wapgw1-1: id=1098666762 state=lost (new) addr=r(0) ip(10.83.124.65)  
votes=1 born=3404 seen=3496 proc=00013312
Mar  1 11:15:59 wapgw1-log cib: [24543]: info: crm_update_peer: Node 
wapgw1-2: id=1115443978 state=lost (new) addr=r(0) ip(10.83.124.66)  
votes=1 born=3400 seen=3496 proc=00013312
Mar  1 11:15:59 wapgw1-log crmd: [24547]: WARN: check_dead_member: Our 
DC node (wapgw1-2) left the cluster
Mar  1 11:15:59 wapgw1-log crmd: [24547]: info: do_state_transition: 
State transition S_NOT_DC - S_ELECTION [ input=I_ELECTION 
cause=C_FSA_INTERNAL origin=check_dead_member ]
Mar  1 11:15:59 wapgw1-log cib: [24543]: info: crm_update_peer: Node 
wapgw1-1: id=1098666762 state=lost (new) addr=r(0) ip(10.83.124.65)  
votes=1 born=3404 seen=3496 proc=00013312

Mar  1 11:15:59 wapgw1-log crmd: [24547]: info: update_dc: Unset DC wapgw1-2
Mar  1 11:15:59 wapgw1-log corosync[24536]:   [MAIN  ] Completed service 
synchronization, ready to provide service.
Mar  1 11:15:59 wapgw1-log crmd: [24547]: info: do_state_transition: 
State transition S_ELECTION - S_INTEGRATION [ input=I_ELECTION_DC 
cause=C_FSA_INTERNAL origin=do_election_check ]
Mar  1 11:15:59 wapgw1-log crmd: [24547]: info: do_te_control: 
Registering TE UUID: 1be865f6-557d-45c4-b549-c10dbab5acc4
Mar  1 11:15:59 wapgw1-log crmd: [24547]: WARN: 
cib_client_add_notify_callback: Callback already present
Mar  1 11:15:59 wapgw1-log crmd: [24547]: info: set_graph_functions: 
Setting custom graph functions
Mar  1 11:15:59 wapgw1-log crmd: [24547]: info: unpack_graph: Unpacked 
transition -1: 0 actions in 0 synapses
Mar  1 11:15:59 wapgw1-log crmd: [24547]: info: do_dc_takeover: Taking 
over DC status for this partition
Mar  1 11:15:59 wapgw1-log cib: [24543]: info: cib_process_readwrite: We 
are now in R/W mode


DC node has noticed member loss:

Mar  1 11:15:59 wapgw1-2 pengine: [5748]: WARN: pe_fence_node: Node 
wapgw1-log will be fenced because it is un-expectedly down
Mar  1 11:15:59 wapgw1-2 pengine: [5748]: info: 
determine_online_status_fencing: ha_state=active, ccm_state=false, 
crm_state=online, join_state=member, 

[Pacemaker] Help with batch import and resource distribution

2011-03-06 Thread Todd Nine
Hi all,
  I'm creating my pacemaker configuration from a script executed by chef and
I'm having some issues.  I have 3 init scripts that run the following
services.

haproxy
nginx
ec2setip
chef-client


I would like the following distribution.

Single Node: haproxy and ec2setip

All other Nodes: nginx.

All nodes: chef-client


Essentially, I use ha proxy for load balancing.  I use nginx for ssl
decryption and serving static pages, so I want that to run on every node
that isn't the ha proxy node.  During a failover, I want haproxy to be
started, and ec2setip to be run on a single node, and all other nodes to
start nginx.  I'm not using STONITH on purpose.  If one node takes over the
IP and another is running, it does not affect my service since none of my
clustered services perform any data write.  I'm using the following
configuration, and importing it with this command.


crm configure  /tmp/proxyfailover.txt

here is the content of  /tmp/proxyfailover.txt


 BEGIN FILE --
property stonith-enabled=false
primitive haproxy lsb:/etc/init.d/haproxy
primitive nginx lsb:/etc/init.d/nginx
primitive ec2setip lsb:/etc/init.d/ec2setip
primitive chefclient lsb:/etc/init.d/chef

order nginx-after-haproxy inf: haproxy nginx
order ec2setip-after-nginx inf: nginx ec2setip
order chefclient-after-ec2setip inf: ec2setip chefclient


commit

 END FILE --

I've read this section, but I'm a bit lost.  Any help would be greatly
appreciated.

http://www.clusterlabs.org/doc/en-US/Pacemaker/1.0/html/Pacemaker_Explained/s-resource-sets-collocation.html#id580996

Thanks,
Todd
___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker