Re: [Pacemaker] Quorum disk?
2010/8/25 Michael Schwartzkopff : > Am Mittwoch, den 25.08.2010, 17:01 -0400 schrieb Ciro Iriarte: >> Hi, I'm planning to use OpanAIS+Pacemaker on SLES11-SP1 and would like >> to know if it's possible to use a quorum disk in a two-node cluster. >> The idea is to avoid adding a third node just for quorum... >> >> Regards, > > Hi, > > you could have a look at the sfex resource agent. > > Greetings, > > Michael Schwartzkopff > Thanks, sounds interesting, but it doesn't modify the quorum count Regards, -- Ciro Iriarte http://cyruspy.wordpress.com -- ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
[Pacemaker] cib fails to start until host is rebooted
Hi, I have a pacemaker/corosync setup on a bunch of fully patched SLES11 SP1 systems. On one of the systems, if I /etc/init.d/openais stop, then /etc/init.d/openais start, pacemaker fails to come up: Aug 30 15:48:09 xen-test1 cib: [5858]: info: crm_cluster_connect: Connecting to OpenAIS Aug 30 15:48:09 xen-test1 cib: [5858]: info: init_ais_connection: Creating connection to our AIS plugin Aug 30 15:48:10 xen-test1 corosync[5851]: [IPC ] Invalid IPC credentials. Aug 30 15:48:10 xen-test1 cib: [5858]: info: init_ais_connection: Connection to our AIS plugin (9) failed: unknown (100) Aug 30 15:48:10 xen-test1 cib: [5858]: CRIT: cib_init: Cannot sign in to the cluster... terminating I've tried rm /var/run/crm/*, but it doesn't help; the only fix is to reboot. I have an strace -f of /etc/init.d/openais start, if that would help. cluster-glue-1.0.5-0.5.1 corosync-1.2.1-0.5.1 libpacemaker3-1.1.2-0.2.1 libcorosync4-1.2.1-0.5.1 libopenais3-1.1.2-0.5.19 pacemaker-1.1.2-0.2.1 openais-1.1.2-0.5.19 Thanks, Mike ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
Re: [Pacemaker] Adding a STONITH module to the distribution
William Seligman writes: > William Seligman writes: > > > I've written a STONITH device script for systems that monitor their UPSes > > w/NUT. I think it might be of sufficient interest to include in the standard > > Pacemaker distribution. What is the procedure for submitting such scripts? > > > > I don't particularly want credit or anything like that. It's just a simple > > script that I think could be a time-saver for sysadmins like me. > > Here's a link to the script. (I would have posted the link to the script > directly, but it has lines longer than 80 characters, and the web interface to > GMANE is giving me some flak.) What I meant to say was "I would have posted the script directly..." > http://bit.ly/3yPjS Oh, frak. bit.ly created a link to my home page, which is of interest to no one. Let me try that again: http://bit.ly/annpi1 > If the comment near the top of the file is not sufficient to put the script > under the GPL license, please let me know. ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
Re: [Pacemaker] Adding a STONITH module to the distribution
William Seligman writes: > I've written a STONITH device script for systems that monitor their UPSes > using > NUT. I think it might be of sufficient interest to include in the standard > Pacemaker distribution. What is the procedure for submitting such scripts? > > I don't particularly want credit or anything like that. It's just a simple > script that I think could be a time-saver for sysadmins like me. Here's a link to the script. (I would have posted the link to the script directly, but it has lines longer than 80 characters, and the web interface to GMANE is giving me some flak.) http://bit.ly/3yPjS If the comment near the top of the file is not sufficient to put the script under the GPL license, please let me know. ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
Re: [Pacemaker] cluster-dlm: set_fs_notified: set_fs_notified no nodeid 1812048064#012
Ok I'll do. Thanks! On 08/30/2010 11:16 AM, Dan Frincu wrote: > Try using RSTP on the switches, if possible, it has a lower > convergence time. > > Roberto Giordani wrote: >> Thanks, >> who should I contact? Which mailing list? >> I've discovered that this problem occours when the port of my switch >> where the cluster ring is connected became "blocked" due spanning tree. >> I've resolved the bug using for the ring a separate switch without >> spanning tre enabled and different subnet. >> Is there a configuration to avoid that before the spanning tree >> recalculate the route due a failure, the cluster nodes doesn't hang? >> The hang occurses on SLES11sp1 too where the servers are up running, the >> cluster status is ok, but when try to connect to the server with ssh, >> after the login hang the session. >> >> Usually the recalculate takes 50 seconds. >> >> Regards, >> Roberto. >> >> On 08/26/2010 10:24 AM, Dejan Muhamedagic wrote: >> >>> Hi, >>> >>> On Thu, Aug 26, 2010 at 09:36:10AM +0200, Andrew Beekhof wrote: >>> >>> On Wed, Aug 18, 2010 at 6:24 PM, Roberto Giordani wrote: > Hello, > I'll explain what’s happened after a network black-out > I've a cluster with pacemaker on Opensuse 11.2 64bit > > Last updated: Wed Aug 18 18:13:33 2010 > Current DC: nodo1 (nodo1) > Version: 1.0.2-ec6b0bbee1f3aa72c4c2559997e675db6ab39160 > 3 Nodes configured. > 11 Resources configured. > > > Node: nodo1 (nodo1): online > Node: nodo3 (nodo3): online > Node: nodo4 (nodo4): online > > Clone Set: dlm-clone > dlm:0 (ocf::pacemaker:controld): Started nodo3 > dlm:1 (ocf::pacemaker:controld): Started nodo1 > dlm:2 (ocf::pacemaker:controld): Started nodo4 > Clone Set: o2cb-clone > o2cb:0 (ocf::ocfs2:o2cb): Started nodo3 > o2cb:1 (ocf::ocfs2:o2cb): Started nodo1 > o2cb:2 (ocf::ocfs2:o2cb): Started nodo4 > Clone Set: XencfgFS-Clone > XencfgFS:0 (ocf::heartbeat:Filesystem):Started nodo3 > XencfgFS:1 (ocf::heartbeat:Filesystem):Started nodo1 > XencfgFS:2 (ocf::heartbeat:Filesystem):Started nodo4 > Clone Set: XenimageFS-Clone > XenimageFS:0(ocf::heartbeat:Filesystem):Started nodo3 > XenimageFS:1(ocf::heartbeat:Filesystem):Started nodo1 > XenimageFS:2(ocf::heartbeat:Filesystem):Started nodo4 > rsa1-fencing(stonith:external/ibmrsa-telnet): Started nodo4 > rsa2-fencing(stonith:external/ibmrsa-telnet): Started nodo3 > rsa3-fencing(stonith:external/ibmrsa-telnet): Started nodo4 > rsa4-fencing(stonith:external/ibmrsa-telnet): Started nodo3 > mailsrv-rm (ocf::heartbeat:Xen): Started nodo3 > dbsrv-rm(ocf::heartbeat:Xen): Started nodo4 > websrv-rm (ocf::heartbeat:Xen): Started nodo4 > > After a switch failure all the nodes and the rsa stonith devices was > unreachable. > > On the cluster happen the following error on one node > > Aug 18 13:11:38 nodo1 cluster-dlm: receive_plocks_stored: > receive_plocks_stored 1778493632:2 need_plocks 0#012 > > Aug 18 13:11:38 nodo1 kernel: [ 4154.272025] [ cut here > ] > > Aug 18 13:11:38 nodo1 kernel: [ 4154.272036] kernel BUG at > /usr/src/packages/BUILD/kernel-xen-2.6.31.12/linux-2.6.31/fs/inode.c:1323! > > Aug 18 13:11:38 nodo1 kernel: [ 4154.272042] invalid opcode: [#1] SMP > > Aug 18 13:11:38 nodo1 kernel: [ 4154.272046] last sysfs file: > /sys/kernel/dlm/0BB443F896254AD3BA8FB960C425B666/control > > Aug 18 13:11:38 nodo1 kernel: [ 4154.272050] CPU 1 > > Aug 18 13:11:38 nodo1 kernel: [ 4154.272053] Modules linked in: > nf_conntrack_ipv4 nf_defrag_ipv4 xt_state nf_conntrack xt_physdev > iptable_filter ip_tables x_tables ocfs2 ocfs2_nodemanager quota_tree > ocfs2_stack_user ocfs2_stackglue dlm configfs netbk coretemp blkbk > blkback_pagemap blktap xenbus_be ipmi_si edd dm_round_robin scsi_dh_rdac > dm_multipath scsi_dh bridge stp llc bonding ipv6 fuse ext4 jbd2 crc16 loop > dm_mod sr_mod ide_pci_generic ide_core iTCO_wdt ata_generic ibmpex i5k_amb > ibmaem iTCO_vendor_support ipmi_msghandler bnx2 i5000_edac 8250_pnp shpchp > ata_piix pcspkr ics932s401 joydev edac_core i2c_i801 ses pci_hotplug 8250 > i2c_core serio_raw enclosure serial_core button sg reiserfs usbhid hid > uhci_hcd ehci_hcd xenblk cdrom xennet fan processor pata_acpi lpfc thermal > thermal_sys hwmon aacraid [last unloaded: ocfs2_stackglue] > > Aug 18 13:11:38 nodo1 kernel: [ 4154.272111] Pid: 8889, comm: dlm_send Not > tainted 2.6.31.12-0.2-xen #1 IBM System x3650 -[7979AC1]- > > Aug 18 13:11:38 nodo1 kernel: [ 4154.2
Re: [Pacemaker] Howto upgrade Pacemaker cluster from Version: 1.0.2 to the last released on clusterlabs
Thanks! On 08/30/2010 11:15 AM, Andrew Beekhof wrote: > http://www.clusterlabs.org/doc/en-US/Pacemaker/1.0/html/Pacemaker_Explained/ap-upgrade-config.html > > On Sat, Aug 28, 2010 at 9:34 AM, Roberto Giordani > wrote: > >> Hello, >> but How to migrate the entire cluster configuration (resource, nodes, >> stonith)? >> Regards, >> Roberto. >> >> On 08/26/2010 09:40 AM, Andrew Beekhof wrote: >> >>> On Wed, Aug 18, 2010 at 11:15 PM, Roberto Giordani >>> wrote: >>> >>> Hello, I'd like to know how is it possible to upgrade a running cluster pacemaker on Opensuse 11.2 version 1.02 to the last available on clusterlabs using dlm + ocfs2 too >>> The problem is that the versions of pacemaker on clusterlabs are >>> probably incompatible with your existing dlm and ocfs2 packages. >>> You'd need to rebuild them against the new pacemaker packages. >>> >>> >>> Could someone explain in some steps how to proceed without loose all the cluster configuration up and running? >>> Assuming you have a compatible set of new packages (see above), just >>> do a rolling upgrade. >>> >>> ___ >>> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org >>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker >>> >>> Project Home: http://www.clusterlabs.org >>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf >>> Bugs: >>> http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker >>> >>> >>> >> >> ___ >> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org >> http://oss.clusterlabs.org/mailman/listinfo/pacemaker >> >> Project Home: http://www.clusterlabs.org >> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf >> Bugs: >> http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker >> >> > ___ > Pacemaker mailing list: Pacemaker@oss.clusterlabs.org > http://oss.clusterlabs.org/mailman/listinfo/pacemaker > > Project Home: http://www.clusterlabs.org > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: > http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker > > ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
Re: [Pacemaker] Two cloned VM, only one of the both shows online when starting corosync/pacemaker
Le 27/08/2010 16:29, Andrew Beekhof a écrit : On Tue, Aug 3, 2010 at 4:40 PM, Guillaume Chanaud wrote: Hello, sorry for the delay it took, july is not the best month to get things working fast. Neither is august :-) lol sure :) Here is the core dump file (55MB) : http://www.connecting-nature.com/corosync/core corosync version is 1.2.3 Sorry, but I can't do anything with that file. Core files are only usable on the machine they came from. you'll have to open it with gdb and type "bt" to get a backtrace. Sorry , saw that after sending last mail. In fact i tried to debug/bt it, but 1. I'm not a c developer (i understand a little about it...) 2. I never used gdb before uh, so hard to step into the corosync debug I'm not sure the trace will be usefull but here it is : Core was generated by `corosync'. Program terminated with signal 6, Aborted. #0 0x003506a329a5 in raise (sig=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:64 64 return INLINE_SYSCALL (tgkill, 3, pid, selftid, sig); (gdb) bt #0 0x003506a329a5 in raise (sig=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:64 #1 0x003506a34185 in abort () at abort.c:92 #2 0x003506a2b935 in __assert_fail (assertion=0x7fce14f0b2ae "token_memb_entries >= 1", file=, line=1194, function=) at assert.c:81 #3 0x7fce14efb716 in memb_consensus_agreed (instance=0x7fce12338010) at totemsrp.c:1194 #4 0x7fce14f01723 in memb_join_process (instance=0x7fce12338010, memb_join=0x822bf8) at totemsrp.c:3922 #5 0x7fce14f01a3a in message_handler_memb_join (instance=0x7fce12338010, msg=, msg_len=optimized out>, endian_conversion_needed=) at totemsrp.c:4165 #6 0x7fce14ef7644 in rrp_deliver_fn (context=, msg=0x822bf8, msg_len=420) at totemrrp.c:1404 #7 0x7fce14ef6569 in net_deliver_fn (handle=, fd=, revents=, data=0x822550) at totemudp.c:1244 #8 0x7fce14ef259a in poll_run (handle=2240235047305084928) at coropoll.c:435 #9 0x00405594 in main (argc=, argv=optimized out>) at main.c:1558 I tried to compile it from source (1.2.7 tag and svn trunk) but i'm unable to backtrace it as gdb tell me he doesn't find debuginfos (i did a ./configure --enable-debug but gdb seems to need a /usr/lib/debug/.build-id/... related to current executable, and i don't know how to generate this) On the 1.2.7 version, init script tell it started correctly but after one or two seconds only lrmd and pengine processes are still alive On the trunk version, the init script fail to start (and so processes are correctly killed) In the 1.2.7 when i'm stepping, i'm unable to go further than service.c:201res = service->exec_init_fn (corosync_api); as it should create a new process for pacemaker services i think (i don't know how to step inside this new process and debug it) If you need/want i'll let you access this vm via ssh to test/debug it. It should be related to other posts about "Could not connect to the CIB service: connection failed" (i saw some message related to things more or less like my problem) I put back end of the messages log here : Aug 30 16:30:50 www01 crmd: [19821]: notice: ais_dispatch: Membership 208656: quorum acquired Aug 30 16:30:50 www01 crmd: [19821]: info: crm_update_peer: Node www01.connecting-nature.com: id=1006676160 state=member (new) addr=r(0) ip(192.168.0.60) ( Aug 30 16:30:50 www01 crmd: [19821]: info: crm_new_peer: Node now has id: 83929280 Aug 30 16:30:50 www01 crmd: [19821]: info: crm_update_peer: Node (null): id=83929280 state=member (new) addr=r(0) ip(192.168.0.5) votes=0 born=0 seen=20865 Aug 30 16:30:50 www01 crmd: [19821]: info: crm_new_peer: Node filer2.connecting-nature.com now has id: 100706496 Aug 30 16:30:50 www01 crmd: [19821]: info: crm_new_peer: Node 100706496 is now known as filer2.connecting-nature.com Aug 30 16:30:50 www01 crmd: [19821]: info: crm_update_peer: Node filer2.connecting-nature.com: id=100706496 state=member (new) addr=r(0) ip(192.168.0.6) vo Aug 30 16:30:50 www01 crmd: [19821]: info: crm_new_peer: Node now has id: 1174448320 Aug 30 16:30:50 www01 crmd: [19821]: info: crm_update_peer: Node (null): id=1174448320 state=member (new) addr=r(0) ip(192.168.0.70) votes=0 born=0 seen=20 Aug 30 16:30:50 www01 crmd: [19821]: info: do_started: The local CRM is operational Aug 30 16:30:50 www01 crmd: [19821]: info: do_state_transition: State transition S_STARTING -> S_PENDING [ input=I_PENDING cause=C_FSA_INTERNAL origin=do_st Aug 30 16:30:50 www01 corosync[19809]: [TOTEM ] FAILED TO RECEIVE Aug 30 16:30:51 www01 crmd: [19821]: info: ais_dispatch: Membership 208656: quorum retained Aug 30 16:30:51 www01 crmd: [19821]: info: te_connect_stonith: Attempting connection to fencing daemon... Aug 30 16:30:52 www01 crmd: [19821]: info: te_connect_stonith: Connected Aug 30 16:30:52 www01 cib: [19817]: ERROR: ais_dispatch: Receiving message body failed: (2) Library error: Resource temporarily unavailable (11) Aug 30 16:30:52 www01
Re: [Pacemaker] clmvd hangs on node1 if node2 is fenced
Michael Smith writes: > I've got a pair of fully patched SLES11 SP1 nodes and they're showing > what I guess is the same behaviour: if I hard-poweroff node2, operations > like "vgdisplay -v" hang on node1 for quite some time. Sometimes a > minute, sometimes two, sometimes forever. They get stuck here: Hi Michael, the Bug is fixed with the DEVEL Package after SP1 - and yes you need STONITH to work it stable ;) Kind regards, Rainer ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
[Pacemaker] DRBD Outdated by Heartbeat/Pacemaker - node alive don't get Primary
Hi pacemaker group, I am using Debian 5.0.5 Lenny, DRBD 8.3.7, Heartbeat 3.0.3 (backports), pacemaker 1.0.9 (backports) I have a problem with putting nodes in standby mode, or shutting down one node : When one node is offline or in standby (crm node standby), the other one goes slave and DRBD gets secondary / outdated : #crm_mon Last updated: Mon Aug 30 11:50:45 2010 Stack: Heartbeat Current DC: swmaster1 (2cd4bf30-7a63-4da7-9102-b4f49d91b9d0) - partition with quorum Version: 1.0.9-unknown 2 Nodes configured, unknown expected votes 2 Resources configured. Online: [ swmaster1 ] OFFLINE: [ swslave1 ] Master/Slave Set: ms_drbd_mysql Slaves: [ swmaster1 ] Stopped: [ drbd_mysql:0 ] _ SWMaster1:~# cat /proc/drbd version: 8.3.7 (api:88/proto:86-91) built-in 1: cs:WFConnection ro:Secondary/Unknown ds:Outdated/DUnknown C r ns:1104 nr:744 dw:1944 dr:67439479 al:44 bm:67 lo:0 pe:0 ua:0 ap:0 ep:1 wo:b oos:64 ___ When both nodes are online, everything is ok, and I can switch resources using 'crm resource migrate grp_mysql' : Last updated: Mon Aug 30 11:57:09 2010 Stack: Heartbeat Current DC: swmaster1 (2cd4bf30-7a63-4da7-9102-b4f49d91b9d0) - partition with quorum Version: 1.0.9-unknown 2 Nodes configured, unknown expected votes 2 Resources configured. Online: [ swslave1 swmaster1 ] Master/Slave Set: ms_drbd_mysql Masters: [ swmaster1 ] Slaves: [ swslave1 ] Resource Group: grp_mysql fs_mysql (ocf::heartbeat:Filesystem):Started swmaster1 mysqld (lsb:mysql):Started swmaster1 _ Reconnecting...SWMaster1:~# cat /proc/drbd version: 8.3.7 (api:88/proto:86-91) built-in 1: cs:Connected ro:Primary/Secondary ds:UpToDate/UpToDate C r ns:1520 nr:1136 dw:2704 dr:67449610 al:50 bm:79 lo:0 pe:0 ua:0 ap:0 ep:1 wo:b oos:0 ___ Instead of having an HA infrastructure, I have a LA :). When I use DRBD manually and shutting down hearbeat (/etc/init.d/heartbeat stop), I can stop DRBD on one side and the other node stay in update state, so I can put it primary (drbdadm primary all). How can I do to make understand Heartbeat/Pacemaker not to put DRBD in Outdated state and make it putting services/resources on the other node ? Here are my configurations : SWMaster1:~# crm configure show node $id="2cd4bf30-7a63-4da7-9102-b4f49d91b9d0" swmaster1 \ attributes standby="off" node $id="e022eabd-ef7b-4049-b941-fc26d00c5cd1" swslave1 \ attributes standby="off" primitive drbd_mysql ocf:linbit:drbd \ params drbd_resource="mysql" \ op monitor interval="15s" primitive fs_mysql ocf:heartbeat:Filesystem \ params device="/dev/drbd/by-res/mysql" directory="/var/lib/mysql" fstype="ext3" primitive mysqld lsb:mysql group grp_mysql fs_mysql mysqld ms ms_drbd_mysql drbd_mysql \ meta master-max="1" master-node-max="1" clone-max="2" clone-node-max="1" notify="true" location cli-prefer-mysqld mysqld \ rule $id="cli-prefer-rule-mysqld" inf: #uname eq swmaster1 location cli-standby-grp_mysql grp_mysql \ rule $id="cli-standby-rule-grp_mysql" -inf: #uname eq swslave1 colocation mysql_on_drbd inf: grp_mysql ms_drbd_mysql:Master order mysql_after_drbd inf: ms_drbd_mysql:promote grp_mysql:start property $id="cib-bootstrap-options" \ dc-version="1.0.9-unknown" \ cluster-infrastructure="Heartbeat" \ stonith-enabled="false" \ no-quorum-policy="ignore" ___ SWMaster1:~# cat /etc/ha.d/ha.cf use_logd on autojoin none node SWMaster1 node SWSlave1 crm yes compression bz2 warntime 10 deadtime 40 initdead 60 msgfmt netstring ucast eth0 ip.serv.mas.ter ucast eth0 ip.serv.sla.ve ___ cat /etc/drbd.conf global { usage-count yes; } common { protocol C; syncer { #algorithme a utiliser et activation de la possibilite verification de synchronisation on-line - drbdadm verify [ressource|all] verify-alg sha1; #comparaison de blocs par checksum pour verifier necessite ecriture csums-alg sha1; #vitesse de synchronisation - drbdsetup /dev/drbdnum syncer -r 10M rate 7M; } disk{ on-io-error detach; } net { # http://www.drbd.org/users-guide-emb/s-integrity-check.html
Re: [Pacemaker] cluster-dlm: set_fs_notified: set_fs_notified no nodeid 1812048064#012
Try using RSTP on the switches, if possible, it has a lower convergence time. Roberto Giordani wrote: Thanks, who should I contact? Which mailing list? I've discovered that this problem occours when the port of my switch where the cluster ring is connected became "blocked" due spanning tree. I've resolved the bug using for the ring a separate switch without spanning tre enabled and different subnet. Is there a configuration to avoid that before the spanning tree recalculate the route due a failure, the cluster nodes doesn't hang? The hang occurses on SLES11sp1 too where the servers are up running, the cluster status is ok, but when try to connect to the server with ssh, after the login hang the session. Usually the recalculate takes 50 seconds. Regards, Roberto. On 08/26/2010 10:24 AM, Dejan Muhamedagic wrote: Hi, On Thu, Aug 26, 2010 at 09:36:10AM +0200, Andrew Beekhof wrote: On Wed, Aug 18, 2010 at 6:24 PM, Roberto Giordani wrote: Hello, I'll explain what’s happened after a network black-out I've a cluster with pacemaker on Opensuse 11.2 64bit Last updated: Wed Aug 18 18:13:33 2010 Current DC: nodo1 (nodo1) Version: 1.0.2-ec6b0bbee1f3aa72c4c2559997e675db6ab39160 3 Nodes configured. 11 Resources configured. Node: nodo1 (nodo1): online Node: nodo3 (nodo3): online Node: nodo4 (nodo4): online Clone Set: dlm-clone dlm:0 (ocf::pacemaker:controld): Started nodo3 dlm:1 (ocf::pacemaker:controld): Started nodo1 dlm:2 (ocf::pacemaker:controld): Started nodo4 Clone Set: o2cb-clone o2cb:0 (ocf::ocfs2:o2cb): Started nodo3 o2cb:1 (ocf::ocfs2:o2cb): Started nodo1 o2cb:2 (ocf::ocfs2:o2cb): Started nodo4 Clone Set: XencfgFS-Clone XencfgFS:0 (ocf::heartbeat:Filesystem):Started nodo3 XencfgFS:1 (ocf::heartbeat:Filesystem):Started nodo1 XencfgFS:2 (ocf::heartbeat:Filesystem):Started nodo4 Clone Set: XenimageFS-Clone XenimageFS:0(ocf::heartbeat:Filesystem):Started nodo3 XenimageFS:1(ocf::heartbeat:Filesystem):Started nodo1 XenimageFS:2(ocf::heartbeat:Filesystem):Started nodo4 rsa1-fencing(stonith:external/ibmrsa-telnet): Started nodo4 rsa2-fencing(stonith:external/ibmrsa-telnet): Started nodo3 rsa3-fencing(stonith:external/ibmrsa-telnet): Started nodo4 rsa4-fencing(stonith:external/ibmrsa-telnet): Started nodo3 mailsrv-rm (ocf::heartbeat:Xen): Started nodo3 dbsrv-rm(ocf::heartbeat:Xen): Started nodo4 websrv-rm (ocf::heartbeat:Xen): Started nodo4 After a switch failure all the nodes and the rsa stonith devices was unreachable. On the cluster happen the following error on one node Aug 18 13:11:38 nodo1 cluster-dlm: receive_plocks_stored: receive_plocks_stored 1778493632:2 need_plocks 0#012 Aug 18 13:11:38 nodo1 kernel: [ 4154.272025] [ cut here ] Aug 18 13:11:38 nodo1 kernel: [ 4154.272036] kernel BUG at /usr/src/packages/BUILD/kernel-xen-2.6.31.12/linux-2.6.31/fs/inode.c:1323! Aug 18 13:11:38 nodo1 kernel: [ 4154.272042] invalid opcode: [#1] SMP Aug 18 13:11:38 nodo1 kernel: [ 4154.272046] last sysfs file: /sys/kernel/dlm/0BB443F896254AD3BA8FB960C425B666/control Aug 18 13:11:38 nodo1 kernel: [ 4154.272050] CPU 1 Aug 18 13:11:38 nodo1 kernel: [ 4154.272053] Modules linked in: nf_conntrack_ipv4 nf_defrag_ipv4 xt_state nf_conntrack xt_physdev iptable_filter ip_tables x_tables ocfs2 ocfs2_nodemanager quota_tree ocfs2_stack_user ocfs2_stackglue dlm configfs netbk coretemp blkbk blkback_pagemap blktap xenbus_be ipmi_si edd dm_round_robin scsi_dh_rdac dm_multipath scsi_dh bridge stp llc bonding ipv6 fuse ext4 jbd2 crc16 loop dm_mod sr_mod ide_pci_generic ide_core iTCO_wdt ata_generic ibmpex i5k_amb ibmaem iTCO_vendor_support ipmi_msghandler bnx2 i5000_edac 8250_pnp shpchp ata_piix pcspkr ics932s401 joydev edac_core i2c_i801 ses pci_hotplug 8250 i2c_core serio_raw enclosure serial_core button sg reiserfs usbhid hid uhci_hcd ehci_hcd xenblk cdrom xennet fan processor pata_acpi lpfc thermal thermal_sys hwmon aacraid [last unloaded: ocfs2_stackglue] Aug 18 13:11:38 nodo1 kernel: [ 4154.272111] Pid: 8889, comm: dlm_send Not tainted 2.6.31.12-0.2-xen #1 IBM System x3650 -[7979AC1]- Aug 18 13:11:38 nodo1 kernel: [ 4154.272113] RIP: e030:[] [] iput+0x82/0x90 Aug 18 13:11:38 nodo1 kernel: [ 4154.272121] RSP: e02b:88014ec03c30 EFLAGS: 00010246 Aug 18 13:11:38 nodo1 kernel: [ 4154.272122] RAX: RBX: 880148a703c8 RCX: Aug 18 13:11:38 nodo1 kernel: [ 4154.272123] RDX: c901 RSI: 880148a70380 RDI: 880148a703c8 Aug 18 13:11:38 nodo1 kernel: [ 4154.272125] RBP: 88014ec03c50 R08: b038 R09: fe99594c51a57607 Aug 18 13:11:38 nodo1 kernel: [ 4154.272126] R10: 880040410270 R11: R12: 8801713e6e08 Aug 18 13:11:38 nodo1 kernel:
Re: [Pacemaker] Howto upgrade Pacemaker cluster from Version: 1.0.2 to the last released on clusterlabs
http://www.clusterlabs.org/doc/en-US/Pacemaker/1.0/html/Pacemaker_Explained/ap-upgrade-config.html On Sat, Aug 28, 2010 at 9:34 AM, Roberto Giordani wrote: > Hello, > but How to migrate the entire cluster configuration (resource, nodes, > stonith)? > Regards, > Roberto. > > On 08/26/2010 09:40 AM, Andrew Beekhof wrote: >> On Wed, Aug 18, 2010 at 11:15 PM, Roberto Giordani >> wrote: >> >>> Hello, >>> I'd like to know how is it possible to upgrade a running cluster >>> pacemaker on Opensuse 11.2 version 1.02 to the last available on clusterlabs >>> using dlm + ocfs2 too >>> >> The problem is that the versions of pacemaker on clusterlabs are >> probably incompatible with your existing dlm and ocfs2 packages. >> You'd need to rebuild them against the new pacemaker packages. >> >> >>> Could someone explain in some steps how to proceed without loose all the >>> cluster configuration up and running? >>> >> Assuming you have a compatible set of new packages (see above), just >> do a rolling upgrade. >> >> ___ >> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org >> http://oss.clusterlabs.org/mailman/listinfo/pacemaker >> >> Project Home: http://www.clusterlabs.org >> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf >> Bugs: >> http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker >> >> > > > ___ > Pacemaker mailing list: Pacemaker@oss.clusterlabs.org > http://oss.clusterlabs.org/mailman/listinfo/pacemaker > > Project Home: http://www.clusterlabs.org > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: > http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker > ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
Re: [Pacemaker] ocf:pacemaker:o2cb Unable to connect to CKPT
On Wed, Aug 25, 2010 at 11:05 AM, Michael Schwartzkopff wrote: > Am Mittwoch, den 25.08.2010, 09:43 +0200 schrieb Andrew Beekhof: >> On Fri, Aug 6, 2010 at 3:33 PM, Michael Fung wrote: >> > Hi All, >> > >> > >> > I am still testing with the Debian Squeeze machine. >> > >> > Unable to start the RA ocf:pacemaker:o2cb > (...) >> >> No. It just tells corosync to load the extra services like ckpt (part >> of openais) needed by ocfs2 > > > Hi, > > how can I tell corosync to load ckpt service? Add a service block like you do for pacemaker or use the same option as michael ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
Re: [Pacemaker] Howto upgrade Pacemaker cluster from Version: 1.0.2 to the last released on clusterlabs
Hello, but How to migrate the entire cluster configuration (resource, nodes, stonith)? Regards, Roberto. On 08/26/2010 09:40 AM, Andrew Beekhof wrote: > On Wed, Aug 18, 2010 at 11:15 PM, Roberto Giordani > wrote: > >> Hello, >> I'd like to know how is it possible to upgrade a running cluster >> pacemaker on Opensuse 11.2 version 1.02 to the last available on clusterlabs >> using dlm + ocfs2 too >> > The problem is that the versions of pacemaker on clusterlabs are > probably incompatible with your existing dlm and ocfs2 packages. > You'd need to rebuild them against the new pacemaker packages. > > >> Could someone explain in some steps how to proceed without loose all the >> cluster configuration up and running? >> > Assuming you have a compatible set of new packages (see above), just > do a rolling upgrade. > > ___ > Pacemaker mailing list: Pacemaker@oss.clusterlabs.org > http://oss.clusterlabs.org/mailman/listinfo/pacemaker > > Project Home: http://www.clusterlabs.org > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: > http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker > > ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
Re: [Pacemaker] how to keep ftp connection when swap from primary to secondary
Am Donnerstag, den 26.08.2010, 17:17 +0200 schrieb Raoul Bhatia [IPAX]: > On 08/26/2010 04:42 PM, liang...@asc-csa.gc.ca wrote: > > I have followed the guide in “Clusters from Scratch” written by Andrew > > Beekhof and successfully setup an Active/Passive pair of cluster > > servers. The cluster runs in Fedora 13 and includes services like > > apache, vsftpd and nfs. Drbd is used to allow data consistence during a > > failover. Everything works fine except ftp lose its connection when the > > service swaps from primary to the secondary or vice versa. I know to > > keep the ftp connection, one may need to keep the connection states for > > the session across the nodes. But I couldn’t find clue how to do it. > > Does anyone there have any idea how to keep the ftp connection when > > swapping nodes, if it is possible? > > hi, > > as of now, we're not syncing our connections between the load > balancers, but i would suggest > http://www.linuxvirtualserver.org/docs/sync.html and the like. > > > cheers, > raoul Even a Load Balancer wouldn't sync the data that the FTP server on the real servers hold in RAM. You would need a cluster-aware FTP for such purpose. On the other hand: How often does a failover happen? Is it really nescessary to take care for such rare events? Michael. ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
Re: [Pacemaker] cluster-dlm: set_fs_notified: set_fs_notified no nodeid 1812048064#012
Thanks, who should I contact? Which mailing list? I've discovered that this problem occours when the port of my switch where the cluster ring is connected became "blocked" due spanning tree. I've resolved the bug using for the ring a separate switch without spanning tre enabled and different subnet. Is there a configuration to avoid that before the spanning tree recalculate the route due a failure, the cluster nodes doesn't hang? The hang occurses on SLES11sp1 too where the servers are up running, the cluster status is ok, but when try to connect to the server with ssh, after the login hang the session. Usually the recalculate takes 50 seconds. Regards, Roberto. On 08/26/2010 10:24 AM, Dejan Muhamedagic wrote: > Hi, > > On Thu, Aug 26, 2010 at 09:36:10AM +0200, Andrew Beekhof wrote: > >> On Wed, Aug 18, 2010 at 6:24 PM, Roberto Giordani >> wrote: >> >>> Hello, >>> I'll explain what’s happened after a network black-out >>> I've a cluster with pacemaker on Opensuse 11.2 64bit >>> >>> Last updated: Wed Aug 18 18:13:33 2010 >>> Current DC: nodo1 (nodo1) >>> Version: 1.0.2-ec6b0bbee1f3aa72c4c2559997e675db6ab39160 >>> 3 Nodes configured. >>> 11 Resources configured. >>> >>> >>> Node: nodo1 (nodo1): online >>> Node: nodo3 (nodo3): online >>> Node: nodo4 (nodo4): online >>> >>> Clone Set: dlm-clone >>> dlm:0 (ocf::pacemaker:controld): Started nodo3 >>> dlm:1 (ocf::pacemaker:controld): Started nodo1 >>> dlm:2 (ocf::pacemaker:controld): Started nodo4 >>> Clone Set: o2cb-clone >>> o2cb:0 (ocf::ocfs2:o2cb): Started nodo3 >>> o2cb:1 (ocf::ocfs2:o2cb): Started nodo1 >>> o2cb:2 (ocf::ocfs2:o2cb): Started nodo4 >>> Clone Set: XencfgFS-Clone >>> XencfgFS:0 (ocf::heartbeat:Filesystem):Started nodo3 >>> XencfgFS:1 (ocf::heartbeat:Filesystem):Started nodo1 >>> XencfgFS:2 (ocf::heartbeat:Filesystem):Started nodo4 >>> Clone Set: XenimageFS-Clone >>> XenimageFS:0(ocf::heartbeat:Filesystem):Started nodo3 >>> XenimageFS:1(ocf::heartbeat:Filesystem):Started nodo1 >>> XenimageFS:2(ocf::heartbeat:Filesystem):Started nodo4 >>> rsa1-fencing(stonith:external/ibmrsa-telnet): Started nodo4 >>> rsa2-fencing(stonith:external/ibmrsa-telnet): Started nodo3 >>> rsa3-fencing(stonith:external/ibmrsa-telnet): Started nodo4 >>> rsa4-fencing(stonith:external/ibmrsa-telnet): Started nodo3 >>> mailsrv-rm (ocf::heartbeat:Xen): Started nodo3 >>> dbsrv-rm(ocf::heartbeat:Xen): Started nodo4 >>> websrv-rm (ocf::heartbeat:Xen): Started nodo4 >>> >>> After a switch failure all the nodes and the rsa stonith devices was >>> unreachable. >>> >>> On the cluster happen the following error on one node >>> >>> Aug 18 13:11:38 nodo1 cluster-dlm: receive_plocks_stored: >>> receive_plocks_stored 1778493632:2 need_plocks 0#012 >>> >>> Aug 18 13:11:38 nodo1 kernel: [ 4154.272025] [ cut here >>> ] >>> >>> Aug 18 13:11:38 nodo1 kernel: [ 4154.272036] kernel BUG at >>> /usr/src/packages/BUILD/kernel-xen-2.6.31.12/linux-2.6.31/fs/inode.c:1323! >>> >>> Aug 18 13:11:38 nodo1 kernel: [ 4154.272042] invalid opcode: [#1] SMP >>> >>> Aug 18 13:11:38 nodo1 kernel: [ 4154.272046] last sysfs file: >>> /sys/kernel/dlm/0BB443F896254AD3BA8FB960C425B666/control >>> >>> Aug 18 13:11:38 nodo1 kernel: [ 4154.272050] CPU 1 >>> >>> Aug 18 13:11:38 nodo1 kernel: [ 4154.272053] Modules linked in: >>> nf_conntrack_ipv4 nf_defrag_ipv4 xt_state nf_conntrack xt_physdev >>> iptable_filter ip_tables x_tables ocfs2 ocfs2_nodemanager quota_tree >>> ocfs2_stack_user ocfs2_stackglue dlm configfs netbk coretemp blkbk >>> blkback_pagemap blktap xenbus_be ipmi_si edd dm_round_robin scsi_dh_rdac >>> dm_multipath scsi_dh bridge stp llc bonding ipv6 fuse ext4 jbd2 crc16 loop >>> dm_mod sr_mod ide_pci_generic ide_core iTCO_wdt ata_generic ibmpex i5k_amb >>> ibmaem iTCO_vendor_support ipmi_msghandler bnx2 i5000_edac 8250_pnp shpchp >>> ata_piix pcspkr ics932s401 joydev edac_core i2c_i801 ses pci_hotplug 8250 >>> i2c_core serio_raw enclosure serial_core button sg reiserfs usbhid hid >>> uhci_hcd ehci_hcd xenblk cdrom xennet fan processor pata_acpi lpfc thermal >>> thermal_sys hwmon aacraid [last unloaded: ocfs2_stackglue] >>> >>> Aug 18 13:11:38 nodo1 kernel: [ 4154.272111] Pid: 8889, comm: dlm_send Not >>> tainted 2.6.31.12-0.2-xen #1 IBM System x3650 -[7979AC1]- >>> >>> Aug 18 13:11:38 nodo1 kernel: [ 4154.272113] RIP: e030:[] >>> [] iput+0x82/0x90 >>> >>> Aug 18 13:11:38 nodo1 kernel: [ 4154.272121] RSP: e02b:88014ec03c30 >>> EFLAGS: 00010246 >>> >>> Aug 18 13:11:38 nodo1 kernel: [ 4154.272122] RAX: RBX: >>> 880148a703c8 RCX: >>> >>> Aug 18 13:11:38 nodo1 kernel: [ 4154.272123] RDX: c901 RSI: >>> 880148a70380 RDI: 880148a703c8 >>> >>> Aug 18 13:11:38 nodo1
Re: [Pacemaker] ocf:pacemaker:o2cb Unable to connect to CKPT
Am Mittwoch, den 25.08.2010, 09:43 +0200 schrieb Andrew Beekhof: > On Fri, Aug 6, 2010 at 3:33 PM, Michael Fung wrote: > > Hi All, > > > > > > I am still testing with the Debian Squeeze machine. > > > > Unable to start the RA ocf:pacemaker:o2cb (...) > > No. It just tells corosync to load the extra services like ckpt (part > of openais) needed by ocfs2 Hi, how can I tell corosync to load ckpt service? Thanks. > ___ > Pacemaker mailing list: Pacemaker@oss.clusterlabs.org > http://oss.clusterlabs.org/mailman/listinfo/pacemaker > > Project Home: http://www.clusterlabs.org > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: > http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
Re: [Pacemaker] Quorum disk?
Am Mittwoch, den 25.08.2010, 17:01 -0400 schrieb Ciro Iriarte: > Hi, I'm planning to use OpanAIS+Pacemaker on SLES11-SP1 and would like > to know if it's possible to use a quorum disk in a two-node cluster. > The idea is to avoid adding a third node just for quorum... > > Regards, Hi, you could have a look at the sfex resource agent. Greetings, Michael Schwartzkopff ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
Re: [Pacemaker] drbd diskless -> failover to other node
>> Are you saying that if a server loses its disk, it will transparently >> write to the secondary server without any need to failover at all? > > Yes. As long as it still has a network connection to the peer, of course. > >> WOW. I never knew DRBD did this. This is a _fantastic_ feature :) > > Well, that's what diskless mode is really all about. > http://www.drbd.org/users-guide/s-handling-disk-errors.html A final question: does DRBD switch to Protocol C in diskless mode, or does it stay with the configured Protocol? If it doesn't switch, can it be configured to? ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker