> Andrew Beekhof
> Mon, 13 Sep 2010 06:25:48 -0700
> 
> Looks like corosync can't talk to itself - ie. it never sees the
> multicast messages it sends out.
> This would result in the pacemaker errors you're seeing.
> 
> Almost always this is a firewall issue :-)
> Perhaps try disabling it completely?

Sorry for the late reply. I have been seeing a lot of strange stuff and I 
wanted them confirmed before wasting more of your time. I also tried some fixes 
I had found in the net and naturally had other stuff to do.

Even though I do not always reply immediately or at all, I am always very 
grateful for everyone's help.

I also upgraded all our libs to the latest version and including those from the 
Lenny backports.
I had initially thought it was a problem with our kernel but that did not pan 
out.
But I'll just stick with the newer versions for now.

The strangest thing is, the behavior is completely random.

Sometimes it just works.
Sometimes it just dies after the start (could be the race condition mentioned 
below)
Sometimes I get these:
 crmd: [3086]: WARN: lrm_signon: can not initiate connection
 crmd: [3086]: WARN: do_lrm_control: Failed to sign on to the LRM 2 (30 max) 
times
 crmd: [3086]: info: ais_dispatch: Membership 92: quorum still lost
Sometimes I get these:
 crmd: [3067]: info: crm_timer_popped: Wait Timer (I_NULL) just popped!
 crmd: [3067]: info: do_cib_control: Could not connect to the CIB service: 
connection failed
 crmd: [3067]: WARN: do_cib_control: Couldn't complete CIB registration 2 
times... pause and retry
Or these:
 crmd: [2649]: notice: Not currently connected.
 crmd: [2649]: ERROR: te_connect_stonith: Sign-in failed: triggered a retry
 crmd: [2649]: info: te_connect_stonith: Attempting connection to fencing 
daemon...
 crmd: [2649]: ERROR: stonithd_signon: Can't initiate connection to stonithd

And I get these all the time.
(all above were actually taken from one machine and from subsequent reboots, 
but applies to all)

Strangest of all, if I stop (and kill) corosync and restart it via init.d 
manually, it works fine.
Even without any of the changes mentioned below.

I have tried a lot of (crazy) stuff:
* different network setups
* different hardware
* created resolv.conf
* inet config
* ntp corrections
* disabled firehol (chmod -x on the script)
* disabled bind9 (same)
* disabled drbd from the runlevels (not the script as those above)
* change runlevels as per this post (moved mine to rcS.d/S98corosync):
  http://oss.clusterlabs.org/pipermail/pacemaker/2010-February/005010.html
* upgrading to corosync to 1.2.1.2 due to this race condition bug and 
subsequent fix:
  http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=596694
* removed stuff from the root tag in the cib.xml I thought might be a problem
  I have a template cib.xml that I use and I removed everything that crm_verify 
did not complain about.
* sacrificed coffee to the IT gods.

All on different machines but all with nearly the same outcome. Changes were 
not all done on the same machines to confirm a fix had I ever found it.
I also have a proof-of-concept machine with just the net-install+latest kernel 
from backports and the HA setup and it is showing similar behavior.

Summary:
Debian Lenny 64bit
linux-image-2.6.33.3

Packages:
(default)
corosync 1.2.1-1~bpo50+1
libcorosync4                        1.2.1-1~bpo50+1
(updated)
corosync 1.2.1-2
libcorosync4 1.2.1-2

cluster-glue 1.0.6-1~bpo50+1
libcluster-glue 1.0.6-1~bpo50+1
pacemaker 1.0.9.1+hg15626-1~bpo50+1
libheartbeat2 1:3.0.3-2~bpo50+1
drbd8-utils 2:8.3.7-1~bpo50+1

Any thing else you'd need?

thanks again.

Frank
-- 
GRATIS! Movie-FLAT mit über 300 Videos. 
Jetzt freischalten unter http://portal.gmx.net/de/go/maxdome

_______________________________________________
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker

Reply via email to