Re: [Pacemaker] mcast vs broadcast
On Mon, 2010-01-18 at 11:25 -0500, Shravan Mishra wrote: > Hi all, > > > > Following is my corosync.conf. > > Even though broadcast is enabled I see "mcasted" messages like these > in corosync.log. > > Is it ok? even when the broadcast is on and not mcast. > Yes you are using broadcast and the debug output doesn't print a special case for "broadcast" (but it really is broadcasting). This output is debug output meant for developer consumption. It is really not all that useful for end users. > == > Jan 18 09:50:40 corosync [TOTEM ] mcasted message added to pending queue > Jan 18 09:50:40 corosync [TOTEM ] mcasted message added to pending queue > Jan 18 09:50:40 corosync [TOTEM ] Delivering 171 to 173 > Jan 18 09:50:40 corosync [TOTEM ] Delivering MCAST message with seq > 172 to pending delivery queue > Jan 18 09:50:40 corosync [TOTEM ] Delivering MCAST message with seq > 173 to pending delivery queue > Jan 18 09:50:40 corosync [TOTEM ] Received ringid(192.168.2.1:168) seq 172 > Jan 18 09:50:40 corosync [TOTEM ] Received ringid(192.168.2.1:168) seq 172 > Jan 18 09:50:40 corosync [TOTEM ] Received ringid(192.168.2.1:168) seq 173 > Jan 18 09:50:40 corosync [TOTEM ] Received ringid(192.168.2.1:168) seq 173 > Jan 18 09:50:40 corosync [TOTEM ] releasing messages up to and including 172 > Jan 18 09:50:40 corosync [TOTEM ] releasing messages up to and including 173 > > > = > > === > > # Please read the corosync.conf.5 manual page > compatibility: whitetank > > totem { > version: 2 > token: 3000 > token_retransmits_before_loss_const: 10 > join: 60 > consensus: 1500 > vsftype: none > max_messages: 20 > clear_node_high_bit: yes > secauth: on > threads: 0 > rrp_mode: passive > > interface { > ringnumber: 0 > bindnetaddr: 192.168.2.0 > # mcastaddr: 226.94.1.1 > broadcast: yes > mcastport: 5405 > } > interface { > ringnumber: 1 > bindnetaddr: 172.20.20.0 > #mcastaddr: 226.94.2.1 > broadcast: yes > mcastport: 5405 > } > } > logging { > fileline: off > to_stderr: yes > to_logfile: yes > to_syslog: yes > logfile: /tmp/corosync.log > debug: on > timestamp: on > logger_subsys { > subsys: AMF > debug: off > } > } > > service { > name: pacemaker > ver: 0 > } > > aisexec { > user:root > group: root > } > > amf { > mode: disabled > } > = > > > > Thanks > Shravan > > ___ > Pacemaker mailing list > Pacemaker@oss.clusterlabs.org > http://oss.clusterlabs.org/mailman/listinfo/pacemaker ___ Pacemaker mailing list Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker
Re: [Pacemaker] errors in corosync.log
One possibility is you have a different cluster in your network on the same multicast address and port. Regards -steve On Sat, 2010-01-16 at 15:20 -0500, Shravan Mishra wrote: > Hi Guys, > > I'm running the following version of pacemaker and corosync > corosync=1.1.1-1-2 > pacemaker=1.0.9-2-1 > > Every thing had been running fine for quite some time now but then I > started seeing following errors in the corosync logs, > > > = > Jan 16 15:08:39 corosync [TOTEM ] Received message has invalid > digest... ignoring. > Jan 16 15:08:39 corosync [TOTEM ] Invalid packet data > Jan 16 15:08:39 corosync [TOTEM ] Received message has invalid > digest... ignoring. > Jan 16 15:08:39 corosync [TOTEM ] Invalid packet data > Jan 16 15:08:39 corosync [TOTEM ] Received message has invalid > digest... ignoring. > Jan 16 15:08:39 corosync [TOTEM ] Invalid packet data > > > I can perform all the crm shell commands and what not but it's > troubling that the above is happening. > > My crm_mon output looks good. > > > I also checked the authkey and did md5sum on both it's same. > > Then I stopped corosync and regenerated the authkey with > corosync-keygen and copied it to the the other machine but I still get > the above message in the corosync log. > > Is there anything other authkey that I should look into ? > > > corosync.conf > > > > # Please read the corosync.conf.5 manual page > compatibility: whitetank > > totem { > version: 2 > token: 3000 > token_retransmits_before_loss_const: 10 > join: 60 > consensus: 1500 > vsftype: none > max_messages: 20 > clear_node_high_bit: yes > secauth: on > threads: 0 > rrp_mode: passive > > interface { > ringnumber: 0 > bindnetaddr: 192.168.2.0 > #mcastaddr: 226.94.1.1 > broadcast: yes > mcastport: 5405 > } > interface { > ringnumber: 1 > bindnetaddr: 172.20.20.0 > #mcastaddr: 226.94.1.1 > broadcast: yes > mcastport: 5405 > } > } > > > logging { > fileline: off > to_stderr: yes > to_logfile: yes > to_syslog: yes > logfile: /tmp/corosync.log > debug: off > timestamp: on > logger_subsys { > subsys: AMF > debug: off > } > } > > service { > name: pacemaker > ver: 0 > } > > aisexec { > user:root > group: root > } > > amf { > mode: disabled > } > > > === > > > Thanks > Shravan > > ___ > Pacemaker mailing list > Pacemaker@oss.clusterlabs.org > http://oss.clusterlabs.org/mailman/listinfo/pacemaker ___ Pacemaker mailing list Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker
Re: [Pacemaker] Split Site 2-way clusters
Florian(and all), thanks for the reply. I've gone over past threads on the DRBD list as you suggested, and found only this: http://archives.free.net.ph/message/20090909.131635.ef640f6a.en.html I am not entirely certain what specific problem the one-separate-cluster-at-each-site design addresses that one-node-on-each-site does not. >From the above thread, the only roadblock explicitly mentioned was setting up >cross-site multicast routing, which needs to be made to work. Fair enough. I'd like to get a clear idea of what the roadblocks --actually are-- (not on a "The WAN link" level but what the WAN link -actually breaks-) to doing what I suggested. Assuming I can get it to work, are there any other specific reasons it wouldn't? To recap, in my proposed solution, an outage will result in four things: --- 1. A "Race" by both nodes to a 3rd site, to perform an atomic operation (a mkdir for instance). Following it, it will be abundantly clear to both nodes "who is right, and who is dead". --- 2. A hard-iLO-poweroff STONITH (NOT reboot!) from the winner to the loser's iLO. It can also iptables-block all comms from the loser until further notice as an extra safety-net. --- 3. A hard-own-iLO-poweroff-else-kernel-halt SMITH (NOT reboot!) suicide by the loser (SMITH is our pet acronym for Shoot-Myself-...). --- 4. A "WAN-PROBLEM=[true|false] flag immediately raised (locally) by the winner based on pinging the OTHER SITE's ROUTER. A separate resource on the winner will, in the presence of this flag, monitor the same router of the other site for life, and when the other site comes back up (perhaps -and-stays-up-for-an-hour- or some similar flap-avoiding logic) issues a POWERON to the other node's iLO which will come back up as a drbd slave, resync and get re-promoted to master. As an attractive side-benefit, this is a deathmatch-proof design. NOTE: There's a departure from common wisdom here, and I am not sure whether this one of the issues you're pointing at. Common wisdom states: SMITH BAD, not reliable (obvious reasons - no success/failure etc) In this solution I claim: SMIT BAD, not reliable, except in one specific failure mode (WAN outage) where SMITH GOOD, is reliable, shortcomings can be worked around. both steps [2] and [3] are issued on EVERY TYPE of outage, regardless of whether it's WAN-related or not. In non-WAN issues the loser is considered compromised, thus making [3] unreliable, but [2] is reliable. In WAN issues, the WAN is considered compromised, thus making [2] unreliable, but the node itself is sound, so [3] still is reliable. To sum up, it looks to me like the "data safety" is provided by the layer underneath DRBD, not DRBD itself, and if it works as advertised, DRBD should have no problem, thus we have a system sufficiently reliable to withstand any scenario short of a double failure. ... thoughts? -- -Original Message- From: Florian Haas [mailto:florian.h...@linbit.com] Sent: Monday, 18 January 2010 9:36 PM To: pacemaker@oss.clusterlabs.org Subject: Re: [Pacemaker] Split Site 2-way clusters On 2010-01-18 11:14, Andrew Beekhof wrote: > On Thu, Jan 14, 2010 at 11:44 PM, Miki Shapiro > wrote: >> Confused. >> >> >> >> I *am* running DRBD in dual-master mode > > /me cringes... this sounds to me like an impossibly dangerous idea. > Can someone from linbit comment on this please? Am I imagining this? Dual-Primary DRBD in a split site cluster? Really really bad idea. Anyone attempting this, please search the drbd-user archives for multiple discussions about this in the past. Then reconsider. Hope that makes it clear enough. Florian __ This email and any attachments may contain privileged and confidential information and are intended for the named addressee only. If you have received this e-mail in error, please notify the sender and delete this e-mail immediately. Any confidentiality, privilege or copyright is not waived or lost because this e-mail has been sent to you in error. It is your responsibility to check this e-mail and any attachments for viruses. No warranty is made that this material is free from computer virus or any other defect or error. Any loss/damage incurred by using this material is not the sender's responsibility. The sender's entire liability will be limited to resupplying the material. __ ___ Pacemaker mailing list Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker
[Pacemaker] 1.0.7 upgraded, restarting resources problem
Hi, i have one m/s drbd resource and one Xen instance on top. Both m/s are primary. When i restart node that's _not_ hosting the Xen instance (ibm1), pacemaker restarts running Xen instance on the other node (ibm2). There is no need to do that. I thought it got fixed (http://developerbugs.linux-foundation.org/show_bug.cgi?id=2153). Didn't it? Here is my config once more. Please note the WARNING showed up only after upgrade. (BTW setting drbd0predHosting score to 0 doesn't restart it. But it doesn't help resource ordering either.) [r...@ibm1 etc]# crm configure show WARNING: notify: operation name not recognized node $id="3d430f49-b915-4d52-a32b-b0799fa17ae7" ibm2 node $id="4b2047c8-f3a0-4935-84a2-967b548598c9" ibm1 primitive Hosting ocf:heartbeat:Xen \ params xmfile="/etc/xen/Hosting.cfg" shutdown_timeout="303" \ meta target-role="Started" allow-migrate="true" is-managed="true" \ op monitor interval="120s" timeout="506s" start-delay="5s" \ op migrate_to interval="0s" timeout="304s" \ op migrate_from interval="0s" timeout="304s" \ op stop interval="0s" timeout="304s" \ op start interval="0s" timeout="202s" primitive drbd_r0 ocf:linbit:drbd \ params drbd_resource="r0" \ op monitor interval="15s" role="Master" timeout="30s" \ op monitor interval="30s" role="Slave" timeout="30s" \ op stop interval="0s" timeout="501s" \ op notify interval="0s" timeout="90s" \ op demote interval="0s" timeout="90s" \ op promote interval="0s" timeout="90s" \ op start interval="0s" timeout="255s" ms ms_drbd_r0 drbd_r0 \ meta notify="true" master-max="2" inteleave="true" is-managed="true" target-role="Started" order drbd0predHosting inf: ms_drbd_r0:promote Hosting:start property $id="cib-bootstrap-options" \ dc-version="1.0.7-b1191b11d4b56dcae8f34715d52532561b875cd5" \ cluster-infrastructure="Heartbeat" \ stonith-enabled="false" \ no-quorum-policy="ignore" \ default-resource-stickiness="10" \ last-lrm-refresh="1263845352" All i want is to have just one resource Hosting started, after drbd was promoted(/primary) on the node that's it's starting. Please advise me if you can. Thank you, regards, M. ___ Pacemaker mailing list Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker
Re: [Pacemaker] Announce: Pacemaker 1.0.7 (stable) Released
On Mon, Jan 18, 2010 at 1:29 PM, Andrew Beekhof wrote: > On Mon, Jan 18, 2010 at 1:17 PM, Andreas Mock wrote: >>> -Ursprüngliche Nachricht- >>> Von: "Andrew Beekhof" >>> Gesendet: 18.01.10 12:43:30 >>> An: The Pacemaker cluster resource manager >>> Betreff: [Pacemaker] Announce: Pacemaker 1.0.7 (stable) Released >> >> >>> The latest installment of the Pacemaker 1.0 stable series is now ready for >>> general consumption. >> >> Great. >> >>> Pre-built packages for Pacemaker and it s immediate dependancies are >>> currently building and will be available for openSUSE, SLES, Fedora, RHEL, >>> CentOS from the ClusterLabs Build Area (http://www.clusterlabs.org/rpm) >>> shortly. >> >> Please don't forget openSuSE 10.2. I'm waiting... ;-) > > I've not forgotten. > Actually it was the first one i tried but there seems to be some issues there. > Done. Please let me know how it goes. ___ Pacemaker mailing list Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker
[Pacemaker] mcast vs broadcast
Hi all, Following is my corosync.conf. Even though broadcast is enabled I see "mcasted" messages like these in corosync.log. Is it ok? even when the broadcast is on and not mcast. == Jan 18 09:50:40 corosync [TOTEM ] mcasted message added to pending queue Jan 18 09:50:40 corosync [TOTEM ] mcasted message added to pending queue Jan 18 09:50:40 corosync [TOTEM ] Delivering 171 to 173 Jan 18 09:50:40 corosync [TOTEM ] Delivering MCAST message with seq 172 to pending delivery queue Jan 18 09:50:40 corosync [TOTEM ] Delivering MCAST message with seq 173 to pending delivery queue Jan 18 09:50:40 corosync [TOTEM ] Received ringid(192.168.2.1:168) seq 172 Jan 18 09:50:40 corosync [TOTEM ] Received ringid(192.168.2.1:168) seq 172 Jan 18 09:50:40 corosync [TOTEM ] Received ringid(192.168.2.1:168) seq 173 Jan 18 09:50:40 corosync [TOTEM ] Received ringid(192.168.2.1:168) seq 173 Jan 18 09:50:40 corosync [TOTEM ] releasing messages up to and including 172 Jan 18 09:50:40 corosync [TOTEM ] releasing messages up to and including 173 = === # Please read the corosync.conf.5 manual page compatibility: whitetank totem { version: 2 token: 3000 token_retransmits_before_loss_const: 10 join: 60 consensus: 1500 vsftype: none max_messages: 20 clear_node_high_bit: yes secauth: on threads: 0 rrp_mode: passive interface { ringnumber: 0 bindnetaddr: 192.168.2.0 # mcastaddr: 226.94.1.1 broadcast: yes mcastport: 5405 } interface { ringnumber: 1 bindnetaddr: 172.20.20.0 #mcastaddr: 226.94.2.1 broadcast: yes mcastport: 5405 } } logging { fileline: off to_stderr: yes to_logfile: yes to_syslog: yes logfile: /tmp/corosync.log debug: on timestamp: on logger_subsys { subsys: AMF debug: off } } service { name: pacemaker ver: 0 } aisexec { user:root group: root } amf { mode: disabled } = Thanks Shravan ___ Pacemaker mailing list Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker
Re: [Pacemaker] errors in corosync.log
Hi, I'm seeing following messages in corosync.log = Jan 18 09:50:41 corosync [pcmk ] ERROR: check_message_sanity: Message payload is corrupted: expected 1929 bytes, got 669 Jan 18 09:50:41 corosync [pcmk ] ERROR: check_message_sanity: Child 28857 spawned to record non-fatal assertion failure line 1286: sane Jan 18 09:50:41 corosync [pcmk ] ERROR: check_message_sanity: Invalid message 70: (dest=local:cib, from=node1.itactics.com:cib.22575, compressed=0, size=1929, total=2521) .. I'm not entirely sure what's casuing them. Thanks Shravan On Mon, Jan 18, 2010 at 9:03 AM, Shravan Mishra wrote: > Hi , > > Since the interfaces on the two nodes are connected via cross over > cable so there is no chance of that happening and since I'm using rrp: > passive, which means that the other ring i.e. ring 1 will come into > play only when ring 0 fails,I assume. I say this because ring 1 > interface is on the network. > > > Once interesting that I observed was that > lintomcrypt is being used for crypto reasons because I have secauth: on. > > But I couldn't find that library on my machine. > > I'm wondering if it's because of that. > > Basically we are using 3 interfaces eth0, eth1 and eth2. > > eth0 and eth2 are for ring 0 and ring 1 respectively. eth1 is the > primary interface. > > This is what my drbd.conf looks like: > > > == > # please have a a look at the example configuration file in > # /usr/share/doc/drbd82/drbd.conf > # > global { > usage-count no; > } > common { > protocol C; > startup { > wfc-timeout 120; > degr-wfc-timeout 120; > } > } > resource var_nsm { > syncer { > rate 333M; > } > handlers { > fence-peer "/usr/lib/drbd/crm-fence-peer.sh"; > after-resync-target > "/usr/lib/drbd/crm-unfence-peer.sh"; > } > net { > after-sb-1pri discard-secondary; > } > on node1.itactics.com { > device /dev/drbd1; > disk /dev/sdb3; > address 172.20.20.1:7791; > meta-disk internal; > } > on node2.itactics.com { > device /dev/drbd1; > disk /dev/sdb3; > address 172.20.20.2:7791; > meta-disk internal; > } > } > = > > > eth0's of the two nodes are connected via cross over as I mentioned > and eth1 and eth2 are on the network. > > I'm not a networking expert but is it possible that broadcast done by > ,let's say, any node not in my cluster, will still cause it to come to > my nodes through other interfaces which are attached to the network? > > > We in the dev and the QA guys are testing this in parallel. > > And let's say there is QA cluster of two nodes and dev cluster of 2 nodes. > > And interfaces for both of them are hooked as I mentioned above and that > corosync.conf for both the clusters have "bindnetaddr: 192.168.2.0". > > Is there possibility of bad messages for the cluster casused by the other. > > > We are in the final leg of the testing and this came up. > > Thanks for the help. > > > Shravan > > > > > > > On Mon, Jan 18, 2010 at 2:58 AM, Andrew Beekhof wrote: >> On Sat, Jan 16, 2010 at 9:20 PM, Shravan Mishra >> wrote: >>> Hi Guys, >>> >>> I'm running the following version of pacemaker and corosync >>> corosync=1.1.1-1-2 >>> pacemaker=1.0.9-2-1 >>> >>> Every thing had been running fine for quite some time now but then I >>> started seeing following errors in the corosync logs, >>> >>> >>> = >>> Jan 16 15:08:39 corosync [TOTEM ] Received message has invalid >>> digest... ignoring. >>> Jan 16 15:08:39 corosync [TOTEM ] Invalid packet data >>> Jan 16 15:08:39 corosync [TOTEM ] Received message has invalid >>> digest... ignoring. >>> Jan 16 15:08:39 corosync [TOTEM ] Invalid packet data >>> Jan 16 15:08:39 corosync [TOTEM ] Received message has invalid >>> digest... ignoring. >>> Jan 16 15:08:39 corosync [TOTEM ] Invalid packet data >>> >>> >>> I can perform all the crm shell commands and what not but it's >>> troubling that the above is happening. >>> >>> My crm_mon output looks good. >>> >>> >>> I also checked the authkey and did md5sum on both it's same. >>> >>> Then I stopped corosync and regenerated the authkey with >>> corosync-keygen and copied it to the the other machine but I still get >>> the above message in the corosync log. >> >> Are you sure there's not a third node somewhere broadcasting on that >> mcast and port combination? >> >>> >>> Is there anything other authkey that I should look into ? >>> >>> >>> corosync.conf >>> >>> >>> >>> # Please read the corosync.conf.5 manual page >>> compatibility: whitetank >>> >>> totem { >>> version: 2 >>> token: 3000 >>> token_retransmits_before_loss_const: 10 >>> join: 60 >>> consensus: 1500 >>> vsftype: none >>> m
Re: [Pacemaker] errors in corosync.log
Hi , Since the interfaces on the two nodes are connected via cross over cable so there is no chance of that happening and since I'm using rrp: passive, which means that the other ring i.e. ring 1 will come into play only when ring 0 fails,I assume. I say this because ring 1 interface is on the network. Once interesting that I observed was that lintomcrypt is being used for crypto reasons because I have secauth: on. But I couldn't find that library on my machine. I'm wondering if it's because of that. Basically we are using 3 interfaces eth0, eth1 and eth2. eth0 and eth2 are for ring 0 and ring 1 respectively. eth1 is the primary interface. This is what my drbd.conf looks like: == # please have a a look at the example configuration file in # /usr/share/doc/drbd82/drbd.conf # global { usage-count no; } common { protocol C; startup { wfc-timeout 120; degr-wfc-timeout 120; } } resource var_nsm { syncer { rate 333M; } handlers { fence-peer "/usr/lib/drbd/crm-fence-peer.sh"; after-resync-target "/usr/lib/drbd/crm-unfence-peer.sh"; } net { after-sb-1pri discard-secondary; } on node1.itactics.com { device /dev/drbd1; disk /dev/sdb3; address 172.20.20.1:7791; meta-disk internal; } on node2.itactics.com { device /dev/drbd1; disk /dev/sdb3; address 172.20.20.2:7791; meta-disk internal; } } = eth0's of the two nodes are connected via cross over as I mentioned and eth1 and eth2 are on the network. I'm not a networking expert but is it possible that broadcast done by ,let's say, any node not in my cluster, will still cause it to come to my nodes through other interfaces which are attached to the network? We in the dev and the QA guys are testing this in parallel. And let's say there is QA cluster of two nodes and dev cluster of 2 nodes. And interfaces for both of them are hooked as I mentioned above and that corosync.conf for both the clusters have "bindnetaddr: 192.168.2.0". Is there possibility of bad messages for the cluster casused by the other. We are in the final leg of the testing and this came up. Thanks for the help. Shravan On Mon, Jan 18, 2010 at 2:58 AM, Andrew Beekhof wrote: > On Sat, Jan 16, 2010 at 9:20 PM, Shravan Mishra > wrote: >> Hi Guys, >> >> I'm running the following version of pacemaker and corosync >> corosync=1.1.1-1-2 >> pacemaker=1.0.9-2-1 >> >> Every thing had been running fine for quite some time now but then I >> started seeing following errors in the corosync logs, >> >> >> = >> Jan 16 15:08:39 corosync [TOTEM ] Received message has invalid >> digest... ignoring. >> Jan 16 15:08:39 corosync [TOTEM ] Invalid packet data >> Jan 16 15:08:39 corosync [TOTEM ] Received message has invalid >> digest... ignoring. >> Jan 16 15:08:39 corosync [TOTEM ] Invalid packet data >> Jan 16 15:08:39 corosync [TOTEM ] Received message has invalid >> digest... ignoring. >> Jan 16 15:08:39 corosync [TOTEM ] Invalid packet data >> >> >> I can perform all the crm shell commands and what not but it's >> troubling that the above is happening. >> >> My crm_mon output looks good. >> >> >> I also checked the authkey and did md5sum on both it's same. >> >> Then I stopped corosync and regenerated the authkey with >> corosync-keygen and copied it to the the other machine but I still get >> the above message in the corosync log. > > Are you sure there's not a third node somewhere broadcasting on that > mcast and port combination? > >> >> Is there anything other authkey that I should look into ? >> >> >> corosync.conf >> >> >> >> # Please read the corosync.conf.5 manual page >> compatibility: whitetank >> >> totem { >> version: 2 >> token: 3000 >> token_retransmits_before_loss_const: 10 >> join: 60 >> consensus: 1500 >> vsftype: none >> max_messages: 20 >> clear_node_high_bit: yes >> secauth: on >> threads: 0 >> rrp_mode: passive >> >> interface { >> ringnumber: 0 >> bindnetaddr: 192.168.2.0 >> #mcastaddr: 226.94.1.1 >> broadcast: yes >> mcastport: 5405 >> } >> interface { >> ringnumber: 1 >> bindnetaddr: 172.20.20.0 >> #mcastaddr: 226.94.1.1 >> broadcast: yes >> mcastport: 5405 >> } >> } >> >> >> logging { >> fileline: off >> to_stderr: yes >> to_logfile: yes >> to_syslog: yes >> logfile: /tmp/corosync.log >> debug: off >> timestamp: on >> logger_subsys { >> subsys: AMF >>
Re: [Pacemaker] Split Site 2-way clusters
On Mon, Jan 18, 2010 at 11:14:58AM +0100, Andrew Beekhof wrote: > > NodeX(Successfully) taking on data from clients while in > > quorumless-freeze-still-providing-service, then discarding its hitherto > > collected client data when realizing other node has quorum and discarding > > own data isn’t good. > > Agreed - freeze isn't an option if you're doing master/master. no-quorum=freeze alone is not sufficient when doing master/slave, either: if the current master risks to be blown away later, you lose all changes from replication link loss to being shot. So you have to make sure there will be no changes between those two events, you need to also freeze IO on the DRBD Primary. The fence-peer handler script hook, and the DRBD fencing policy are what can be used for this. -- : Lars Ellenberg : LINBIT | Your Way to High Availability : DRBD/HA support and consulting http://www.linbit.com DRBD® and LINBIT® are registered trademarks of LINBIT, Austria. ___ Pacemaker mailing list Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker
Re: [Pacemaker] Announce: Pacemaker 1.0.7 (stable) Released
On Mon, Jan 18, 2010 at 1:17 PM, Andreas Mock wrote: >> -Ursprüngliche Nachricht- >> Von: "Andrew Beekhof" >> Gesendet: 18.01.10 12:43:30 >> An: The Pacemaker cluster resource manager >> Betreff: [Pacemaker] Announce: Pacemaker 1.0.7 (stable) Released > > >> The latest installment of the Pacemaker 1.0 stable series is now ready for >> general consumption. > > Great. > >> Pre-built packages for Pacemaker and it s immediate dependancies are >> currently building and will be available for openSUSE, SLES, Fedora, RHEL, >> CentOS from the ClusterLabs Build Area (http://www.clusterlabs.org/rpm) >> shortly. > > Please don't forget openSuSE 10.2. I'm waiting... ;-) I've not forgotten. Actually it was the first one i tried but there seems to be some issues there. ___ Pacemaker mailing list Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker
Re: [Pacemaker] Announce: Pacemaker 1.0.7 (stable) Released
> -Ursprüngliche Nachricht- > Von: "Andrew Beekhof" > Gesendet: 18.01.10 12:43:30 > An: The Pacemaker cluster resource manager > Betreff: [Pacemaker] Announce: Pacemaker 1.0.7 (stable) Released > The latest installment of the Pacemaker 1.0 stable series is now ready for > general consumption. Great. > Pre-built packages for Pacemaker and its immediate dependancies are > currently building and will be available for openSUSE, SLES, Fedora, RHEL, > CentOS from the ClusterLabs Build Area (http://www.clusterlabs.org/rpm) > shortly. Please don't forget openSuSE 10.2. I'm waiting... ;-) Best regards + Thanks Andreas ___ Pacemaker mailing list Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker
Re: [Pacemaker] Split Site 2-way clusters
On Mon, Jan 18, 2010 at 11:52 AM, Florian Haas wrote: > > the current approach is to utilize 2 Pacemaker clusters, each highly > available in its own right, and employing manual failover. As described > here: Thanks for the pointer! Perhaps "site" is not quite the correct term for our setup, where we still have (multiple) Gbit-or-faster ethernet links, think fire areas, at most in adjacent buildings. For the next step up, two geographically different sites, I agree that manual failover is more appropriate, but we feel that our case of the fire areas should still be handled automatically…(?) Can anybody judge how difficult it would be to integrate some kind of quorum-support into the cluster? (All cluster nodes attempt a quorum reservation; the node that gets it, has 1.5 or 2 votes towards the quorum, rather than just one; this would ensure continued operation in the case of a) a fire area losing power, b) the separate quorum-server failing, or c) the cross-fire-area cluster-interconnects failing (but not more than one failure at a time)…) Regards, Colin ___ Pacemaker mailing list Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker
[Pacemaker] Announce: Pacemaker 1.0.7 (stable) Released
The latest installment of the Pacemaker 1.0 stable series is now ready for general consumption. In this release, we’ve made a number improvements to clone handling - particularly the way ordering constraints are processed - as well as some really nice improvements to the shell. The next 1.0 release is anticipated to be in mid-March. We will be switching to a bi-monthly release schedule to begin focusing on development for the next stable series (more details soon). So, if you have feature requests, now is the time to voice them and/or provide patches :-) Pre-built packages for Pacemaker and it’s immediate dependancies are currently building and will be available for openSUSE, SLES, Fedora, RHEL, CentOS from the ClusterLabs Build Area (http://www.clusterlabs.org/rpm) shortly. Read the full announcement at: http://theclusterguy.clusterlabs.org/post/340780359/pacemaker-1-0-7-released General installation instructions are available at from the ClusterLabs wiki: http://clusterlabs.org/wiki/Install -- Andrew ___ Pacemaker mailing list Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker
Re: [Pacemaker] Pre-Announce: End of 0.6 support is near
On 2010-01-18 12:09, Andrew Beekhof wrote: > On Mon, Jan 18, 2010 at 11:57 AM, Florian Haas > wrote: >> On 2010-01-18 11:18, Andrew Beekhof wrote: >>> Biggest caveat is the networking issue that makes pacemaker 1.0 >>> wire-incompatible with pacemaker 0.6 (and heartbeat 2.1.x). >>> So rolling upgrades are out and you'd need to look at one of the other >>> upgrade strategies. >> Even though I've bugged you about this repeatedly in the past, I'll >> reiterate that I think this non-support of rolling upgrades is a bad >> thing(tm). > > Its not something that was done intentionally, and we have tests in > place to ensure it doesn't happen again. > But given that to-date about 4 people have noticed it didn't work (and > my employer has no interest in older versions especially when they're > running heartbeat), I have no current inclination to spend time on the > problem myself. > > That doesn't prevent the vocal minority that maintain its a huge issue > affecting half the globe from fixing the problem instead of being a > pests. If you spent have as much time looking into the problem as > moaning about it, it would probably be done by now. Calm down. I thought one smiley face was enough to mark the post as at least partially ironic. Suggested course of action: Remove this part: "This method is currently broken between Pacemaker 0.6.x and 1.0.x Measures have been put into place to ensure rolling upgrades always work for versions after 1.0.0 If there is sufficient demand, the work to repair 0.6 -> 1.0 compatibility will be carried out. Otherwise, please try one of the other upgrade strategies. Detach/Reattach is a particularly good option for most people." from the "rolling upgrades" section in the docs, and declare that you will only ever guarantee to support rolling upgrades within the same minor release, and adjacent minor releases when the major release number got bumped. Then: * Rolling upgrades would always be supported between 1.n.x and 1.n.y for any value of n, x and y; * Rolling upgrades would be always supported between 1.n.x and 1.n+1.0, where x is the final bugfix release of the 1.n series; * Any other upgrade paths would only be supported on a best-effort basis, with detach/reattach as a readily available fallback option. Just my two cents. Florian signature.asc Description: OpenPGP digital signature ___ Pacemaker mailing list Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker
Re: [Pacemaker] [Linux-HA] Announce: Hawk (HA Web Konsole)
I look forward to taking this for a spin! Do we have a bugzilla component for it yet? On Sat, Jan 16, 2010 at 2:14 PM, Tim Serong wrote: > Greetings All, > > This is to announce the development of the Hawk project, > a web-based GUI for Pacemaker HA clusters. > > So, why another management tool, given that we already have > the crm shell, the Python GUI, and DRBD MC? In order: > > 1) We have the usual rationale for a GUI over (or in addition > to) a CLI tool; it is (or should be) easier to use, for > a wider audience. > > 2) The Python GUI is not always easily installable/runnable > (think: sysadmins with Windows desktops and/or people who > don't want to, or can't, forward X). > > 3) Believe it or not, there are a number of cases where, > citing security reasons, site policy prohibits ssh access > to servers (which is what DRBD MC uses internally). > > There are also some differing goals; Hawk is not intended > to expose absolutely everything. There will be point somewhere > where you have to say "and now you must learn to use a shell". > > Likewise, Hawk is not intended to install the base cluster > stack for you (whereas DRBD MC does a good job of this). > > It's early days yet (no downloadable packages), but you can > get the current source as follows: > > # hg clone http://hg.clusterlabs.org/pacemaker/hawk > # cd hawk > # hg update tip > > This will give you a web-based GUI with a display roughly > analagous to crm_mon, in terms of status of cluster resources. > It will show you running/dead/standby nodes, and the resources > (clones, groups & primitives) running on those nodes. > > It does not yet provide information about failed resources or > nodes, other than the fact that they are not running. > > Display of nodes & resources is collapsible (collapsed by > default), but if something breaks while you are looking at it, > the display will expand to show the broken nodes and/or > resources. > > Hawk is intended to run on each node in your cluster. You > can then access it by pointing your web browser at the IP > address of any cluster node, or the address of any IPaddr(2) > resource you may have configured. > > Minimally, to see it in action, you will need the following > packages and their dependencies (names per openSUSE/SLES): > > - ruby > - rubygem-rails-2_3 > - rubygem-gettext_rails > > Once you've got those installed, run the following command: > > # hawk/script/server > > Then, point your browser at http://your-server:3000/ to see > the status of your cluster. > > Ultimately, hawk is intended to be installed and run as a > regular system service via /etc/init.d/hawk. To do this, > you will need the following additional packages: > > - lighttpd > - lighttpd-mod_magnet > - ruby-fcgi > - rubygem-rake > > Then, try the following, but READ THE MAKEFILE FIRST! > "make install" (and the rest of the build system for that > matter) is frightfully primitive at the moment: > > # make > # sudo make install > # /etc/init.d/hawk start > > Then, point your browser at http://your-server:/ to see > the status of your cluster. > > Assuming you've read this far, what next? > > - In the very near future (but probably not next week, > because I'll be busy at linux.conf.au) you can expect to > see further documentation and roadmap info up on the > clusterlabs.org wiki. > > - Immediate goal is to obtain feature parity with crm_mon > (completing status display, adding error/failure messages). > > - Various pieces of scaffolding need to be put in place (login > page, access via HTTPS, clean up build/packaging, theming, > etc.) > > - After status display, the following major areas of > funcionality are: > - Basic operator tasks (stop/start/migrate resource, > standby/online node, etc.) > - Explore failure scenarios (shadow CIB magic to see > what would happen if a node/resource failed). > - Ability to actually configure resources and nodes. > > Please direct comments, feedback, questions, etc. to > tser...@novell.com and/or the Pacemaker mailing list. > > Thank you for your attention. > > Regards, > > Tim > > > -- > Tim Serong > Senior Clustering Engineer, Novell Inc. > > > ___ > Linux-HA mailing list > linux...@lists.linux-ha.org > http://lists.linux-ha.org/mailman/listinfo/linux-ha > See also: http://linux-ha.org/ReportingProblems > ___ Pacemaker mailing list Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker
Re: [Pacemaker] Pre-Announce: End of 0.6 support is near
On Mon, Jan 18, 2010 at 11:57 AM, Florian Haas wrote: > On 2010-01-18 11:18, Andrew Beekhof wrote: >> Biggest caveat is the networking issue that makes pacemaker 1.0 >> wire-incompatible with pacemaker 0.6 (and heartbeat 2.1.x). >> So rolling upgrades are out and you'd need to look at one of the other >> upgrade strategies. > > Even though I've bugged you about this repeatedly in the past, I'll > reiterate that I think this non-support of rolling upgrades is a bad > thing(tm). Its not something that was done intentionally, and we have tests in place to ensure it doesn't happen again. But given that to-date about 4 people have noticed it didn't work (and my employer has no interest in older versions especially when they're running heartbeat), I have no current inclination to spend time on the problem myself. That doesn't prevent the vocal minority that maintain its a huge issue affecting half the globe from fixing the problem instead of being a pests. If you spent have as much time looking into the problem as moaning about it, it would probably be done by now. ___ Pacemaker mailing list Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker
Re: [Pacemaker] Pre-Announce: End of 0.6 support is near
On 2010-01-18 11:18, Andrew Beekhof wrote: > Biggest caveat is the networking issue that makes pacemaker 1.0 > wire-incompatible with pacemaker 0.6 (and heartbeat 2.1.x). > So rolling upgrades are out and you'd need to look at one of the other > upgrade strategies. Even though I've bugged you about this repeatedly in the past, I'll reiterate that I think this non-support of rolling upgrades is a bad thing(tm). Just so someone puts this on the record. :) Cheers, Florian signature.asc Description: OpenPGP digital signature ___ Pacemaker mailing list Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker
Re: [Pacemaker] Split Site 2-way clusters
On 2010-01-18 11:41, Colin wrote: > Hi All, > > we are currently looking at nearly the same issue, in fact I just > wanted to start a similarly titled thread when I stumbled over these > messages… > > The setup we are evaluating is actually a 2*N-node-cluster, i.e. two > slightly separated sites with N nodes each. The main difference to an > N-node-cluster is that a failure of one of the two groups of nodes > must be considered a single failure event [against which the cluster > must protect, e.g. loss of power at one site]. Colin, the current approach is to utilize 2 Pacemaker clusters, each highly available in its own right, and employing manual failover. As described here: http://www.drbd.org/users-guide/s-pacemaker-floating-peers.html#s-pacemaker-floating-peers-site-fail-over May be combined with DRBD resource stacking, obviously. Given the fact that most organizations currently employ a non-automatic policy to site failover (as in, "must be authorized by J. Random Vice President"), this is a sane approach that works for most. Automatic failover is a different matter, not just with regard to clustering (where neither Corosync nor Pacemaker nor Heartbeat currently support any concept of "sites"), but also in terms of IP address failover, dynamic routing, etc. Cheers, Florian signature.asc Description: OpenPGP digital signature ___ Pacemaker mailing list Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker
Re: [Pacemaker] Split Site 2-way clusters
Hi All, we are currently looking at nearly the same issue, in fact I just wanted to start a similarly titled thread when I stumbled over these messages… The setup we are evaluating is actually a 2*N-node-cluster, i.e. two slightly separated sites with N nodes each. The main difference to an N-node-cluster is that a failure of one of the two groups of nodes must be considered a single failure event [against which the cluster must protect, e.g. loss of power at one site]. As far as I gather from this, and other, mail threads, there is currently no out-of-the-box quorum-something solution for pacemaker. Before I start digging deeper [into possible solutions], there's one question I need to ask: In a pacemaker + corosync setup, who decides whether a partition has quorum? I.e, would a quorum-device mechanism need to be integrated with corosync, or with pacemaker, or with both? Thanks, Colin ___ Pacemaker mailing list Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker
Re: [Pacemaker] Split Site 2-way clusters
On 2010-01-18 11:14, Andrew Beekhof wrote: > On Thu, Jan 14, 2010 at 11:44 PM, Miki Shapiro > wrote: >> Confused. >> >> >> >> I *am* running DRBD in dual-master mode > > /me cringes... this sounds to me like an impossibly dangerous idea. > Can someone from linbit comment on this please? Am I imagining this? Dual-Primary DRBD in a split site cluster? Really really bad idea. Anyone attempting this, please search the drbd-user archives for multiple discussions about this in the past. Then reconsider. Hope that makes it clear enough. Florian signature.asc Description: OpenPGP digital signature ___ Pacemaker mailing list Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker
Re: [Pacemaker] DC election with downed node in 2-way cluster
On Thu, Jan 14, 2010 at 4:40 AM, Miki Shapiro wrote: >>> And the node really did power down? > Yes. 100% certain and positive. OFF. > >>> But the other node didn't notice?!? > Its resources (drbd master and the fence clone) did notice. > Its dc-election-mechanism did NOT notice (and the survivor didn't re-elect) > Its quorum-election mechanism did NOT notice (and the survivor still thinks > it has quorum). > > Logs attached. Hmmm. Not much to see there. crmd gets the membership event and then just sort of stops. Could you try again with debug turned on in openais.conf please? > > Keep in mind I'm relatively new to this. PEBKAC not entirely outside the > realm of the possible ;) Doesn't look like it, but you might want to try something a little more recent than 1.0.3. > Thanks! > > -Original Message- > From: Andrew Beekhof [mailto:and...@beekhof.net] > Sent: Wednesday, 13 January 2010 7:26 PM > To: pacemaker@oss.clusterlabs.org > Subject: Re: [Pacemaker] DC election with downed node in 2-way cluster > > On Wed, Jan 13, 2010 at 9:12 AM, Miki Shapiro > wrote: >> Halt = soft off - a natively issued poweroff command that shuts stuff down >> nicely, then powers the blade off. > > And the node really did power down? > But the other node didn't notice?!? That is insanely bad - looking > forward to those logs. > >> Logs I'll send tomorrow (our timezone is just wrapping up for the day). > > Yep, I'm actually an Aussie too... just not living there at the moment :-) > > ___ > Pacemaker mailing list > Pacemaker@oss.clusterlabs.org > http://oss.clusterlabs.org/mailman/listinfo/pacemaker > > __ > This email and any attachments may contain privileged and confidential > information and are intended for the named addressee only. If you have > received this e-mail in error, please notify the sender and delete > this e-mail immediately. Any confidentiality, privilege or copyright > is not waived or lost because this e-mail has been sent to you in > error. It is your responsibility to check this e-mail and any > attachments for viruses. No warranty is made that this material is > free from computer virus or any other defect or error. Any > loss/damage incurred by using this material is not the sender's > responsibility. The sender's entire liability will be limited to > resupplying the material. > __ > ___ > Pacemaker mailing list > Pacemaker@oss.clusterlabs.org > http://oss.clusterlabs.org/mailman/listinfo/pacemaker > > ___ Pacemaker mailing list Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker
Re: [Pacemaker] Pre-Announce: End of 0.6 support is near
On Tue, Jan 12, 2010 at 3:55 PM, Emmanuel Lesouef wrote: > Le Tue, 12 Jan 2010 14:56:31 +0100, > Michael Schwartzkopff a écrit : > >> Am Dienstag, 12. Januar 2010 14:48:12 schrieb Emmanuel Lesouef: >> > Hi, >> > >> > We use a rather old (in fact, very old) combination : >> > >> > heartbeat 2.99 + openhpi 2.12 >> > >> > What do you suggest in order to upgrade to the latest version of >> > pacemaker ? >> > >> > Thanks. >> >> http://www.clusterlabs.org/wiki/Upgrade >> > > Thanks for your answer. I already saw : > http://www.clusterlabs.org/doc/en-US/Pacemaker/1.0/html/Pacemaker_Explained/ap-upgrade.html > > In fact, my question wans't about the upgrading process but more about > polling this list about caveats, advices or best practice when dealing > with rather old & uncommon configuration. Biggest caveat is the networking issue that makes pacemaker 1.0 wire-incompatible with pacemaker 0.6 (and heartbeat 2.1.x). So rolling upgrades are out and you'd need to look at one of the other upgrade strategies. ___ Pacemaker mailing list Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker
Re: [Pacemaker] Pre-Announce: End of 0.6 support is near
On Tue, Jan 12, 2010 at 2:48 PM, Emmanuel Lesouef wrote: > Hi, > > We use a rather old (in fact, very old) combination : > > heartbeat 2.99 + openhpi 2.12 > > What do you suggest in order to upgrade to the latest version of > pacemaker ? What version of pacemaker/crm though? "heartbeat 2.99" doesn't contain any of the crm bits that became pacemaker ___ Pacemaker mailing list Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker
Re: [Pacemaker] Split Site 2-way clusters
On Thu, Jan 14, 2010 at 11:44 PM, Miki Shapiro wrote: > Confused. > > > > I *am* running DRBD in dual-master mode /me cringes... this sounds to me like an impossibly dangerous idea. Can someone from linbit comment on this please? Am I imagining this? > (apologies, I should have mentioned > that earlier), and there will be both WAN clients as well as > local-to-datacenter-clients writing to both nodes on both ends. It’s safe to > assume the clients will know not of the split. > > > > In a WAN split I need to ensure that the node whose idea of drbd volume will > be kept once resync happens stays up, and node that’ll get blown away and > re-synced/overwritten becomes dead asap. Won't you _always_ loose some data in a WAN split though? AFAICS, you're doing here is preventing "some" being "lots". Is master/master really a requirement? > NodeX(Successfully) taking on data from clients while in > quorumless-freeze-still-providing-service, then discarding its hitherto > collected client data when realizing other node has quorum and discarding > own data isn’t good. Agreed - freeze isn't an option if you're doing master/master. > > To recap what I understood so far: > > 1. CRM Availability on the multicast channel drives DC election, but > DC election is irrelevant to us here. > > 2. CRM Availability on the multicast channel (rather than resource > failure) drive who-is-in-quorum-and-who-is-not decisions [not sure here.. > correct? correct > Or does resource failure drive quorum? ] quorum applies to node availability - resource failures have no impact (unless they lead to fencing with then leads to the node leaving the membership) > > 3. Steve to clarify what happens quorum-wise if 1/3 nodes sees both > others, but the other two only see the first (“broken triangle”), and > whether this behaviour may differ based on whether the first node (which is > different as it sees both others) happens to be the DC at the time or not. Try in a cluster of 3 VMs? Just use iptables rules to simulate the broken links > > Given that anyone who goes about building a production cluster would want to > identify all likely failure modes and be able to predict how the cluster > behaves in each one, is there any user-targeted doco/rtfm material one could > read regarding how quorum establishment works in such scenarios? I don't think corosync has such a doc at the moment. > Setting up a 3-way with intermittent WAN links without getting a clear > understanding in advance of how the software will behave is … scary J ___ Pacemaker mailing list Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker
Re: [Pacemaker] Cluster group and name confusion
On Sun, Jan 17, 2010 at 7:07 PM, Hunny Bunny wrote: > Hello folkz, > I'm confused under which cluster group and name I should run the whole > cluster environment root/root or hacluster/hauser. > hacluster/hauser > > I have compiled from most recent sources Corosync/OpenAIS, Cluster Glue, > Resource Agents, Pacemaker, DRBD and OCFS2-Tools environment. > > This site http://www.clusterlabs.org/wiki/Install#From_Source > suggests to create > > groupadd -r hacluster > useradd -r -g hacluster -d /var/lib/heartbeat/cores/hacluster -s > /sbin/nologin -c "cluster user" hauser > > However, Corosync/OpenAIS which starts all Pacemaker CRM stuff runs as user > and group root > No it doesn't. It starts the _parent_ process as root. Some parts need to run as root so that they can do things like "add an ip address to the system" or "start apache" - things non-root users can't do. > in /etc/corosync/corosync.conf > > <- snipped --> > > service { > # Load the Pacemaker Cluster Resource Manager > name: pacemaker > ver:0 > } > > aisexec { > user: root > group: root > } > > <- snipped --> > > DRBD, O2CB and OCFS2 start an run as user and group root > > So, should I now change to run all the cluster components as a root/root or > hacluster/haclient > No. > > Could you please clarify this cluster group/user confusion for me. > Did you try running it and looking at the "ps axf" output? > > Many thanks in advance, > > Alex > > > ___ > Pacemaker mailing list > Pacemaker@oss.clusterlabs.org > http://oss.clusterlabs.org/mailman/listinfo/pacemaker > > ___ Pacemaker mailing list Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker
Re: [Pacemaker] errors in corosync.log
On Sat, Jan 16, 2010 at 9:20 PM, Shravan Mishra wrote: > Hi Guys, > > I'm running the following version of pacemaker and corosync > corosync=1.1.1-1-2 > pacemaker=1.0.9-2-1 > > Every thing had been running fine for quite some time now but then I > started seeing following errors in the corosync logs, > > > = > Jan 16 15:08:39 corosync [TOTEM ] Received message has invalid > digest... ignoring. > Jan 16 15:08:39 corosync [TOTEM ] Invalid packet data > Jan 16 15:08:39 corosync [TOTEM ] Received message has invalid > digest... ignoring. > Jan 16 15:08:39 corosync [TOTEM ] Invalid packet data > Jan 16 15:08:39 corosync [TOTEM ] Received message has invalid > digest... ignoring. > Jan 16 15:08:39 corosync [TOTEM ] Invalid packet data > > > I can perform all the crm shell commands and what not but it's > troubling that the above is happening. > > My crm_mon output looks good. > > > I also checked the authkey and did md5sum on both it's same. > > Then I stopped corosync and regenerated the authkey with > corosync-keygen and copied it to the the other machine but I still get > the above message in the corosync log. Are you sure there's not a third node somewhere broadcasting on that mcast and port combination? > > Is there anything other authkey that I should look into ? > > > corosync.conf > > > > # Please read the corosync.conf.5 manual page > compatibility: whitetank > > totem { > version: 2 > token: 3000 > token_retransmits_before_loss_const: 10 > join: 60 > consensus: 1500 > vsftype: none > max_messages: 20 > clear_node_high_bit: yes > secauth: on > threads: 0 > rrp_mode: passive > > interface { > ringnumber: 0 > bindnetaddr: 192.168.2.0 > #mcastaddr: 226.94.1.1 > broadcast: yes > mcastport: 5405 > } > interface { > ringnumber: 1 > bindnetaddr: 172.20.20.0 > #mcastaddr: 226.94.1.1 > broadcast: yes > mcastport: 5405 > } > } > > > logging { > fileline: off > to_stderr: yes > to_logfile: yes > to_syslog: yes > logfile: /tmp/corosync.log > debug: off > timestamp: on > logger_subsys { > subsys: AMF > debug: off > } > } > > service { > name: pacemaker > ver: 0 > } > > aisexec { > user:root > group: root > } > > amf { > mode: disabled > } > > > === > > > Thanks > Shravan > > ___ > Pacemaker mailing list > Pacemaker@oss.clusterlabs.org > http://oss.clusterlabs.org/mailman/listinfo/pacemaker > ___ Pacemaker mailing list Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker